Pronunciation augmentation for Mandarin-English code-switching speech recognition

Code-switching (CS) refers to the phenomenon of using more than one language in an utterance, and it presents great challenge to automatic speech recognition (ASR) due to the code-switching property in one utterance, the pronunciation variation phenomenon of the embedding language words and the heavy training data sparse problem. This paper focuses on the Mandarin-English CS ASR task. We aim at dealing with the pronunciation variation and alleviating the sparse problem of code-switches by using pronunciation augmentation methods. An English-to-Mandarin mix-language phone mapping approach is first proposed to obtain a language-universal CS lexicon. Based on this lexicon, an acoustic data-driven lexicon learning framework is further proposed to learn new pronunciations to cover the accents, mis-pronunciations, or pronunciation variations of those embedding English words. Experiments are performed on real CS ASR tasks. Effectiveness of the proposed methods are examined on all of the conventional, hybrid, and the recent end-to-end speech recognition systems. Experimental results show that both the learned phone mapping and augmented pronunciations can significantly improve the performance of code-switching speech recognition.

[1]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[2]  Shinji Watanabe,et al.  ESPnet: End-to-End Speech Processing Toolkit , 2018, INTERSPEECH.

[3]  Jia Liu,et al.  Phone modeling and combining discriminative training for mandarinenglish bilingual speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Yuchen Zhang,et al.  Exploring Retraining-free Speech Recognition for Intra-sentential Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Tara N. Sainath,et al.  No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  C. Baker Foundations of Bilingual Education and Bilingualism , 1993 .

[7]  Dong Wang,et al.  OC16-CE80: A Chinese-English mixlingual database and a speech recognition baseline , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[8]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yagya Raj Pandeya,et al.  Sound Event Detection in Cowshed using Synthetic Data and Convolutional Neural Network , 2020, 2020 International Conference on Information and Communication Technology Convergence (ICTC).

[10]  Haizhou Li,et al.  SEAME: a Mandarin-English code-switching speech corpus in south-east asia , 2010, INTERSPEECH.

[11]  Alan W. Black,et al.  Automatic Detection of Code-switching Style from Acoustics , 2018, CodeSwitch@ACL.

[12]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[13]  Yijie Li,et al.  Acoustic data augmentation for Mandarin-English code-switching speech recognition , 2020 .

[14]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[15]  Haizhou Li,et al.  On the End-to-End Solution to Mandarin-English Code-switching Speech Recognition , 2018, INTERSPEECH.

[16]  David Sankoff,et al.  A Formal Grammar for Code-Switching. CENTRO Working Papers 8. , 1980 .

[17]  Shun-Po Chuang,et al.  Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation , 2018, INTERSPEECH.

[18]  Bhuvana Ramabhadran,et al.  Comparison of Data Augmentation and Adaptation Strategies for Code-switched Automatic Speech Recognition , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Haizhou Li,et al.  A review of the mandarin-english code-switching corpus: SEAME , 2017, 2017 International Conference on Asian Language Processing (IALP).

[20]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[21]  David A. van Leeuwen,et al.  Semi-supervised acoustic model training for speech with code-switching , 2018, Speech Commun..

[22]  Satoshi Nakamura,et al.  Transcribing against time , 2017, Speech Commun..

[23]  Bo Xu,et al.  Chinese-English bilingual phone modeling for cross-language speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  Bin Ma,et al.  Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data , 2019, INTERSPEECH.

[25]  Chng Eng Siong,et al.  Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition , 2018, INTERSPEECH.

[26]  Dong Yu,et al.  Investigating End-to-end Speech Recognition for Mandarin-english Code-switching , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Pieter Muysken,et al.  One Speaker, Two Languages: Cross-Disciplinary Perspectives on Code-Switching , 1995 .

[28]  Joonwhoan Lee,et al.  Domestic Cat Sound Classification Using Learned Features from Deep Neural Nets , 2018, Applied Sciences.

[29]  Yonghong Yan,et al.  An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[30]  Chung-Hsien Wu,et al.  Code-Switching Event Detection by Using a Latent Language Space Model and the Delta-Bayesian Information Criterion , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[31]  Xiaohui Zhang,et al.  Acoustic Data-Driven Lexicon Learning Based on a Greedy Pronunciation Selection Framework , 2017, INTERSPEECH.

[32]  Yifan Gong,et al.  Towards Code-switching ASR for End-to-end CTC Models , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Lin-Shan Lee,et al.  Transcribing code-switched bilingual lectures using deep neural networks with unit merging in acoustic modeling , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Haizhou Li,et al.  Code-Switching Detection Using ASR-Generated Language Posteriors , 2019, INTERSPEECH.

[35]  Monojit Choudhury,et al.  Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data , 2018, ACL.

[36]  Hung-An Chang,et al.  Recognizing English queries in Mandarin Voice Search , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[39]  P. Auer,et al.  Code-Switching in Conversation: Language, Interaction and Identity , 2000 .

[40]  Yuji Matsumoto,et al.  The 54th Annual Meeting of the Association for Computational Linguistics , 2016 .

[41]  Lin-Shan Lee,et al.  An integrated framework for transcribing Mandarin-English code-mixed lectures with improved acoustic and language modeling , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[42]  Lei Xie,et al.  Towards Language-Universal Mandarin-English Speech Recognition , 2019, INTERSPEECH.

[43]  Yonghong Yan,et al.  Mandarin-English bilingual Speech Recognition for real world music retrieval , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Shinji Watanabe,et al.  Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[46]  Hui Lin,et al.  A study on multilingual acoustic modeling for large vocabulary ASR , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[48]  A. Backus Code-switching in conversation: Language, interaction and identity , 2000 .

[49]  Dong Yu,et al.  Recent progresses in deep learning based acoustic models , 2017, IEEE/CAA Journal of Automatica Sinica.

[50]  David C. S. Li Cantonese‐English code‐switching research in Hong Kong: a Y2K review , 2000 .