Releasing a Toolkit and Comparing the Performance of Language Embeddings Across Various Spoken Language Identification Datasets

In this paper, we propose a software toolkit for easier end-toend training of deep learning based spoken language identification models across several speech datasets. We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification experiments. All models are trained separately on each dataset (closed task) and on a combination of all datasets (open task), after which we compare if the open task training yields better language embeddings. We begin by training all models end-to-end as discriminative classifiers of spectral features, labeled by language. Then, we extract language embedding vectors from the trained end-to-end models, train separate Gaussian Naive Bayes classifiers on the vectors, and compare which model provides best language embeddings for the backend classifier. Our experiments show that the open task condition leads to improved language identification performance on only one of the datasets. In addition, we discovered that increasing x-vector model robustness with random frequency channel dropout significantly reduces its end-to-end classification performance on the test set, while not affecting back-end classification performance of its embeddings. Finally, we note that two baseline models consistently outperformed all other models.

[1]  Alvin F. Martin,et al.  The 2011 NIST Language Recognition Evaluation , 2010, INTERSPEECH.

[2]  Shugong Xu,et al.  Two-stage Training for Chinese Dialect Recognition , 2019, INTERSPEECH.

[3]  Radek Safarík,et al.  Using Deep Neural Networks for Identification of Slavic Languages from Acoustic Signal , 2018, INTERSPEECH.

[4]  Dong Wang,et al.  Phonetic Temporal Neural Model for Language Identification , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  James R. Glass,et al.  Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition , 2018, Odyssey.

[6]  Priyam Jain,et al.  Study on the Effect of Emotional Speech on Language Identification , 2020, 2020 National Conference on Communications (NCC).

[7]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jean-Luc Gauvain,et al.  Spoken Language Identification Using LSTM-Based Angular Proximity , 2017, INTERSPEECH.

[9]  Alan McCree,et al.  Language Recognition for Telephone and Video Speech: The JHU HLTCOE Submission for NIST LRE17 , 2018, Odyssey.

[10]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[11]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[12]  Aku Rouhe,et al.  Spherediar: An Effective Speaker Diarization System for Meeting Data , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[13]  Sergey Ioffe,et al.  Probabilistic Linear Discriminant Analysis , 2006, ECCV.

[14]  Shuai Wang,et al.  What Does the Speaker Embedding Encode? , 2017, INTERSPEECH.

[15]  Lin Li,et al.  Phone-Aware Multi-task Learning and Length Expanding for Short-Duration Language Recognition , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[16]  Abualsoud Hanani,et al.  Spoken Arabic dialect recognition using X-vectors , 2020 .

[17]  Dirk Van Compernolle,et al.  Increasing the robustness of CNN acoustic models using autoregressive moving average spectrogram features and channel dropout , 2017, Pattern Recognit. Lett..

[18]  Alan McCree,et al.  The JHU-MIT System Description for NIST SRE18 , 2019 .

[19]  Sanjeev Khudanpur,et al.  Spoken Language Recognition using X-vectors , 2018, Odyssey.

[20]  James R. Glass,et al.  Automatic Dialect Detection in Arabic Broadcast Speech , 2015, INTERSPEECH.

[21]  Dong Wang,et al.  AP19-OLR Challenge: Three Tasks and Their Baselines , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[22]  Yonghong Yan,et al.  A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification , 2019, INTERSPEECH.

[23]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Colin Raffel,et al.  librosa: Audio and Music Signal Analysis in Python , 2015, SciPy.

[26]  Jun Guo,et al.  Short Utterance Based Speech Language Identification in Intelligent Vehicles With Time-Scale Modifications and Deep Bottleneck Features , 2019, IEEE Transactions on Vehicular Technology.

[27]  Titouan Parcollet,et al.  The Pytorch-kaldi Speech Recognition Toolkit , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Quan Wang,et al.  Tuplemax Loss for Language Identification , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[30]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[31]  Yajie Miao,et al.  EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[32]  Ming Li,et al.  On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[33]  Ji Gao,et al.  DKU-Tencent Submission to Oriental Language Recognition AP18-OLR Challenge , 2019, 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).