Acoustic data augmentation for Mandarin-English code-switching speech recognition

Abstract Code-switching (CS) is a multilingual phenomenon where a speaker uses different languages in an utterance or between alternating utterances. Developing large-scale datasets for training code-switching acoustic and language models is challenging and extremely expensive. In this paper, we focus on the acoustic data augmentation for the Mandarin-English CS speech recognition task. Effectiveness of conventional acoustic data augmentation approaches are examined. More importantly, we propose a CS acoustic event detection system based on the deep neural network to extract real code-switching speech segments automatically. Then, the semi-supervised and active learning techniques are investigated to generate transcriptions of these segments. Finally, code-switching speech synthesis system is introduced to further enhance the acoustic modeling. Experimental results on the OC16-CE80 data, a Mandarin-English mixlingual speech corpus, demonstrate the effectiveness of the proposed methods.

[1]  David A. van Leeuwen,et al.  Exploiting Untranscribed Broadcast Data for Improved Code-Switching Detection , 2017, INTERSPEECH.

[2]  Yonghong Yan,et al.  An Exploration of Dropout with LSTMs , 2017, INTERSPEECH.

[3]  Haizhou Li,et al.  A first speech recognition system for Mandarin-English code-switch conversational speech , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Lin-Shan Lee,et al.  An integrated framework for transcribing Mandarin-English code-mixed lectures with improved acoustic and language modeling , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[5]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[6]  Kai Yu,et al.  A comparative study of robustness of deep learning approaches for VAD , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Chung-Hsien Wu,et al.  Code-Switching Event Detection by Using a Latent Language Space Model and the Delta-Bayesian Information Criterion , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  David A. van Leeuwen,et al.  Semi-supervised acoustic model training for speech with code-switching , 2018, Speech Commun..

[9]  Chng Eng Siong,et al.  Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition , 2018, INTERSPEECH.

[10]  Dau-Cheng Lyu,et al.  Speech Recognition on Code-Switching Among the Chinese Dialects , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Dong Wang,et al.  OC16-CE80: A Chinese-English mixlingual database and a speech recognition baseline , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[12]  Satoshi Nakamura,et al.  Transcribing against time , 2017, Speech Commun..

[13]  C. Baker Foundations of Bilingual Education and Bilingualism , 1993 .

[14]  Ngoc Thang Vu,et al.  Combining recurrent neural networks and factored language models during decoding of code-Switching speech , 2014, INTERSPEECH.

[15]  Xiao Song,et al.  A Multi-task Learning Approach for Mandarin-English Code-Switching Conversational Speech Recognition , 2017, ISICA.

[16]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[17]  Yanhua Long,et al.  Active Learning for LF-MMI Trained Neural Networks in ASR , 2018, INTERSPEECH.

[18]  Bo Xu,et al.  Chinese-English bilingual phone modeling for cross-language speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Yanhua Long,et al.  Large-Scale Semi-Supervised Training in Deep Learning Acoustic Model for ASR , 2019, IEEE Access.

[20]  Sanjeev Khudanpur,et al.  X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Alan W. Black,et al.  Automatic Detection of Code-switching Style from Acoustics , 2018, CodeSwitch@ACL.

[22]  Tan Lee,et al.  Development of a Cantonese-English code-mixing speech corpus , 2005, INTERSPEECH.

[23]  Haizhou Li,et al.  SEAME: a Mandarin-English code-switching speech corpus in south-east asia , 2010, INTERSPEECH.

[24]  Riyaz Ahmad Bhat,et al.  Language Identification in Code-Switching Scenario , 2014, CodeSwitch@EMNLP.

[25]  Thomas Niesler,et al.  Building a Unified Code-Switching ASR System for South African Languages , 2018, INTERSPEECH.

[26]  P. Auer,et al.  Code-Switching in Conversation: Language, Interaction and Identity , 2000 .

[27]  Sanjeev Khudanpur,et al.  A study on data augmentation of reverberant speech for robust speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).