Phone Merging For Code-Switched Speech Recognition

Speakers in multilingual communities often switch between or mix multiple languages in the same conversation. Automatic Speech Recognition (ASR) of code-switched speech faces many challenges including the influence of phones of different languages on each other. This paper shows evidence that phone sharing between languages improves the Acoustic Model performance for Hindi-English code-switched speech. We compare base-line system built with separate phones for Hindi and English with systems where the phones were manually merged based on linguistic knowledge. Encouraged by the improved ASR performance after manually merging the phones, we further investigate multiple data-driven methods to identify phones to be merged across the languages. We show detailed analysis of automatic phone merging in this language pair and the impact it has on individual phone accuracies and WER. Though the best performance gain of 1.2% WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.

[1]  Bo Xu,et al.  Chinese-English bilingual phone modeling for cross-language speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Sunil Kumar Kopparapu,et al.  Mixed Language Speech Recognition without Explicit Identification of Language , 2012 .

[3]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Serena Jeblee,et al.  Using acoustics to improve pronunciation for synthesis of low resource languages , 2015, INTERSPEECH.

[5]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[6]  Tan Lee,et al.  Automatic Recognition of Cantonese-English Code-Mixing Speech , 2009, ROCLING/IJCLCLP.

[7]  Tien Ping Tan,et al.  Automatic Speech Recognition of Code Switching Speech Using 1-Best Rescoring , 2012, 2012 International Conference on Asian Language Processing.

[8]  Haizhou Li,et al.  Integration of language identification into a recognition system for spoken conversations containing code-Switches , 2012, SLTU.

[9]  David A. van Leeuwen,et al.  Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code-switching Frisian Speech , 2016, SLTU.

[10]  Suryakanth V. Gangashetty,et al.  Adapting monolingual resources for code-mixed hindi-english speech recognition , 2017, 2017 International Conference on Asian Language Processing (IALP).

[11]  Tan Lee,et al.  Detection of language boundary in code-switching utterances by bi-phone probabilities , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[12]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tanja Schultz,et al.  Fast bootstrapping of LVCSR systems with multilingual phoneme sets , 1997, EUROSPEECH.

[14]  Xiaohui Zhang,et al.  Improving deep neural network acoustic models using generalized maxout networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Lin-Shan Lee,et al.  An Improved Framework for Recognizing Highly Imbalanced Bilingual Code-Switched Lectures with Cross-Language Acoustic Modeling and Frame-Level Language Identification , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Florin Curelaru,et al.  Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[17]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Dau-Cheng Lyu,et al.  Speech Recognition on Code-Switching Among the Chinese Dialects , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[20]  Alan W. Black,et al.  Speech Synthesis of Code-Mixed Text , 2016, LREC.

[21]  Tetyana Lyudovyk,et al.  Code-Switching speech recognition for closely related languages , 2014, SLTU.

[22]  Alan W Black,et al.  The Festvox Indic Frontend for Grapheme-to-Phoneme Conversion , 2016 .

[23]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[24]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[25]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..