A Recurrent Neural Network-Based Approach to Automatic Language Identification from Speech

The task of automatically identifying the used language from speech signals is known as automatic language identification. It is very much important prior to speech recognition in multilingual scenarios where speakers use more than a single language in course of communication. In this paper, a recurrent neural network (RNN)-based system with long short-term memory (LSTM) along with handcrafted line spectral frequency-based features is proposed for language identification. Experiments were performed on as many as 21908 clips (more than 30 h of data) from the top three spoken languages of the world, namely, English, Chinese, and Spanish, and a highest average accuracy of 95.22% has been obtained.

[1]  Marc A. Zissman,et al.  Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  K. K. Paliwal,et al.  On the use of line spectral frequency parameters for speech recognition , 1992, Digit. Signal Process..

[3]  John R. Hershey,et al.  Language independent end-to-end architecture for joint language identification and speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4]  Santanu Phadikar,et al.  Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal , 2018, Int. J. Speech Technol..

[5]  Anu George,et al.  Automatic language identification for seven Indian languages using higher level features , 2017, 2017 IEEE International Conference on Signal Processing, Informatics, Communication and Energy Systems (SPICES).

[6]  Yonghong Yan,et al.  Similar Language Identification for Uyghur and Kazakh on Short Spoken Texts , 2016, 2016 8th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC).

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  Ian McLoughlin,et al.  LID-Senones and Their Statistics for Language Identification , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Geoffrey Zweig,et al.  LSTM time and frequency recurrence for automatic speech recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Shambhu Shankar Bharti,et al.  Implicit language identification system based on random forest and support vector machine for speech , 2017, 2017 4th International Conference on Power, Control & Embedded Systems (ICPCES).

[11]  Zachary Chase Lipton A Critical Review of Recurrent Neural Networks for Sequence Learning , 2015, ArXiv.

[12]  Hari Krishna Vydana,et al.  Significance of neural phonotactic models for large-scale spoken language identification , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Dong Wang,et al.  Phonetic Temporal Neural Model for Language Identification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[15]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..