Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages

State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with different acoustic conditions due to domain shift. In this paper, we present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems for a subset of six Slavic languages across two domains (read speech and radio broadcast) and examine two low-level signal descriptors (spectral and cepstral features) for this task. Our experiments show that (1) out-of-domain speech samples severely hinder the performance of neural LID models, and (2) while both spectral and cepstral features show comparable performance within-domain, spectral features show more robustness under domain mismatch. Moreover, we apply unsupervised domain adaptation to minimize the discrepancy between the two domains in our study. We achieve relative accuracy improvements that range from 9% to 77% depending on the diversity of acoustic conditions in the source domain.

[1]  Abdel-rahman Mohamed,et al.  Deep Neural Network Acoustic Models for ASR , 2014 .

[2]  Bin Ma,et al.  Spoken Language Recognition: From Fundamentals to Practice , 2013, Proceedings of the IEEE.

[3]  Bin Ma,et al.  A Phonotactic Language Model for Spoken Language Identification , 2005, ACL.

[4]  Yifan Gong,et al.  Unsupervised adaptation with domain separation networks for robust speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Solomon Teferra Abate,et al.  DNN-Based Speech Recognition for Globalphone Languages , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Yusuke Shinohara,et al.  Adversarial Multi-Task Learning of Deep Neural Networks for Robust Speech Recognition , 2016, INTERSPEECH.

[7]  Ngoc Thang Vu,et al.  GlobalPhone: A multilingual text & speech database in 20 languages , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Daniel Garcia-Romero,et al.  Analysis of i-vector Length Normalization in Speaker Recognition Systems , 2011, INTERSPEECH.

[9]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[10]  Sebastian Stüker,et al.  Multilingual shifting deep bottleneck features for low-resource ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Srinivas Bangalore,et al.  Real-time Incremental Speech-to-Speech Translation of Dialogs , 2012, NAACL.

[12]  Jean-Luc Gauvain,et al.  Language identification using phone-based acoustic likelihoods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  James R. Glass,et al.  Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition , 2018, Odyssey.

[14]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[16]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[17]  Yu Zhang,et al.  Language ID-based training of multilingual stacked bottleneck features , 2014, INTERSPEECH.

[18]  Markus Müller,et al.  Using language adaptive deep neural networks for improved multilingual speech recognition , 2015, IWSLT.

[19]  A. Waibel,et al.  Multilinguality in speech and spoken language systems , 2000, Proceedings of the IEEE.

[20]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[21]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[22]  Jean-Luc Gauvain,et al.  Language Recognition for Dialects and Closely Related Languages , 2016, Odyssey.

[23]  Joaquín González-Rodríguez,et al.  Automatic language identification using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Jan Nouza,et al.  ASR for South Slavic Languages Developed in Almost Automated Way , 2016, INTERSPEECH.

[25]  Hang Su,et al.  Factor Analysis Based Speaker Verification Using ASR , 2016, INTERSPEECH.

[26]  Sebastian Stüker,et al.  Language Adaptive DNNs for Improved Low Resource Speech Recognition , 2016, INTERSPEECH.

[27]  Alexander H. Waibel,et al.  Simultaneous translation of lectures and speeches , 2007, Machine Translation.

[28]  Lukás Burget,et al.  Language Recognition in iVectors Space , 2011, INTERSPEECH.

[29]  Hisashi Kawai,et al.  Feature Representation of Short Utterances Based on Knowledge Distillation for Spoken Language Identification , 2018, INTERSPEECH.

[30]  Patrick Kenny,et al.  Bayesian Speaker Verification with Heavy-Tailed Priors , 2010, Odyssey.

[31]  Roger Hsiao,et al.  Improving Language Identification for Multilingual Speakers , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  John H. L. Hansen,et al.  UTD-CRSS submission for MGB-3 Arabic dialect identification: Front-end and back-end advancements on broadcast speech , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[33]  Lars Yencken,et al.  Why are some languages confused for others? Investigating data from the Great Language Game , 2017, PloS one.

[34]  Radek Safarík,et al.  Using Deep Neural Networks for Identification of Slavic Languages from Acoustic Signal , 2018, INTERSPEECH.