Spoken Language Identification with Deep Temporal Neural Network and Multi-levels Discriminative Cues

The language cue is an important component in the task of spoken language identification (LID). But it will take a lot of time to align language cue to speech segment by the manual annotation of professional linguists. Instead of annotating the linguistic phonemes, we use the cooccurrence in speech utterances to find the underlying phoneme-like speech units by unsupervised means. Then, we model phonotactic constraint on the set of phoneme-like units for finding the larger speech segments called the suprasegmental phonemes, and extract the multi-levels language cues from them, including phonetic, phonotactic and prosodic. Further, a novel LID system is proposed based on the architecture of TDNN followed by LSTM-RNN. The proposed LID system is built and compared with the acoustic feature based methods and the phonetic feature based methods on the task of NIST LRE07 and Arabic dialect identification. The experimental results show that our LID system helps to capture robust discriminative information for short duration language identification and high accuracy for dialect identification.

[1]  Dong Wang,et al.  Phonetic Temporal Neural Model for Language Identification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[2]  Yun Lei,et al.  Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Jean-Luc Gauvain,et al.  Spoken Language Identification Using LSTM-Based Angular Proximity , 2017, INTERSPEECH.

[4]  A. Etman,et al.  Language and Dialect Identification: A survey , 2015, 2015 SAI Intelligent Systems Conference (IntelliSys).

[5]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Doroteo Torre Toledano,et al.  An end-to-end approach to language identification in short utterances using convolutional neural networks , 2015, INTERSPEECH.

[7]  Shugong Xu,et al.  Two-stage Training for Chinese Dialect Recognition , 2019, INTERSPEECH.

[8]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[9]  Bo Xu,et al.  An End-to-End Text-Independent Speaker Identification System on Short Utterances , 2018, INTERSPEECH.

[10]  Alan McCree,et al.  Stacked Long-Term TDNN for Spoken Language Recognition , 2016, INTERSPEECH.

[11]  Vidhyasaharan Sethu,et al.  Bidirectional Modelling for Short Duration Language Identification , 2017, INTERSPEECH.

[12]  Yi Liu,et al.  Investigation of Senone-based Long-Short Term Memory RNNs for Spoken Language Recognition , 2016, Odyssey.

[13]  Rong Tong,et al.  Chinese Dialect Identification Using Tone Features Based on Pitch Flux , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[14]  Ashtosh Sapru,et al.  Multi-Dialect Acoustic Modeling Using Phone Mapping and Online i-Vectors , 2019, INTERSPEECH.

[15]  Marc A. Zissman,et al.  Comparison of : Four Approaches to Automatic Language Identification of Telephone Speech , 2004 .

[16]  Djelloul Ziadi,et al.  Prosody-based Spoken Algerian Arabic Dialect Identification , 2015, ICNLSP.

[17]  Puming Zhan,et al.  Deep Learning Based Mandarin Accent Identification for Accent Robust ASR , 2019, INTERSPEECH.

[18]  Bo Xu,et al.  End-to-End Language Identification Using Attention-Based Recurrent Neural Networks , 2016, INTERSPEECH.

[19]  Yonghong Yan,et al.  A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification , 2019, INTERSPEECH.

[20]  Nagaratna B. Chittaragi,et al.  Dialect Identification Using Spectral and Prosodic Features on Single and Ensemble Classifiers , 2017, Arabian Journal for Science and Engineering.

[21]  Sriram Ganapathy,et al.  Supervised I-vector Modeling - Theory and Applications , 2018, INTERSPEECH.

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Douglas A. Reynolds,et al.  Language Recognition via i-vectors and Dimensionality Reduction , 2011, INTERSPEECH.

[24]  Li-Rong Dai,et al.  End-to-End Language Identification Using High-Order Utterance Representation with Bilinear Pooling , 2017, INTERSPEECH.

[25]  Yang Li,et al.  The role of segments and prosody in the identification of a speaker's dialect , 2018, J. Phonetics.

[26]  Yan Song,et al.  i-vector representation based on bottleneck features for language identification , 2013 .

[27]  Bin Ma,et al.  Acoustic Segment Modeling with Spectral Clustering Methods , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[28]  Nobuaki Minematsu,et al.  Unsupervised optimal phoneme segmentation: Objectives, algorithm and comparisons , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).