SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation

The lack of labeled second language (L2) speech data is a major challenge in designing mispronunciation detection models. We introduce SpeechBlender - a fine-grained data augmentation pipeline for generating mispronunciation errors to overcome such data scarcity. The SpeechBlender utilizes varieties of masks to target different regions of phonetic units, and use the mixing factors to linearly interpolate raw speech signals while augmenting pronunciation. The masks facilitate smooth blending of the signals, generating more effective samples than the `Cut/Paste' method. Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models at phoneme level, with a 2.0% gain in Pearson Correlation Coefficient (PCC) compared to the previous state-of-the-art [1]. Additionally, we demonstrate a 5.0% improvement at the phoneme level compared to our baseline. We also observed a 4.6% increase in F1-score with Arabic AraVoiceL2 testset.

[1]  Yuehai Wang,et al.  End-to-end Mispronunciation Detection with Simulated Error Distance , 2022, INTERSPEECH.

[2]  Ashwinkumar Ganesan,et al.  L2-GEN: A Neural Phoneme Paraphrasing Approach to L2 Speech Synthesis for Mispronunciation Diagnosis , 2022, INTERSPEECH.

[3]  Tien-Hong Lo,et al.  3M: An Effective Multi-view, Multi-granularity, and Multi-aspect Modeling Approach to English Pronunciation Assessment , 2022, 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[4]  M. Alsulaiman,et al.  Mispronunciation Detection and Diagnosis with Articulatory-Level Feedback Generation for Non-Native Arabic Speech , 2022, Mathematics.

[5]  James R. Glass,et al.  Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment , 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Xiaohai Tian,et al.  Improving Non-native Word-level Pronunciation Scoring with Phone-level Mixup Data Augmentation and Multi-source Information , 2022, ArXiv.

[7]  Xiaoshuo Xu,et al.  Explore wav2vec 2.0 for Mispronunciation Detection , 2021, Interspeech.

[8]  Liyuan Wang,et al.  Deep Feature Transfer Learning for Automatic Pronunciation Assessment , 2021, Interspeech.

[9]  Jinsong Zhang,et al.  A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques , 2021, ArXiv.

[10]  Daniel Povey,et al.  speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment , 2021, Interspeech.

[11]  P. Rogerson-Revell Computer-Assisted Pronunciation Training (CAPT): Current Issues and Future Directions , 2021 .

[12]  Jinsong Zhang,et al.  Automatic Scoring at Multi-Granularity for L2 Pronunciation , 2020, INTERSPEECH.

[13]  Qin Jin,et al.  Context-aware Goodness of Pronunciation for Computer-Assisted Pronunciation Training , 2020, INTERSPEECH.

[14]  Kai Chen,et al.  SED-MDD: Towards Sentence Dependent End-To-End Mispronunciation Detection and Diagnosis , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Chiranjeevi Yarra,et al.  An Improved Goodness of Pronunciation (GoP) Measure for Pronunciation Evaluation with DNN-HMM System Considering HMM Transition Probabilities , 2019, INTERSPEECH.

[16]  Xunying Liu,et al.  CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Ricardo Gutierrez-Osuna,et al.  L2-ARCTIC: A Non-native English Speech Corpus , 2018, INTERSPEECH.

[18]  James R. Glass,et al.  The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[19]  Diego Giuliani,et al.  The effectiveness of computer assisted pronunciation training for foreign language learning by children , 2008 .

[20]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[21]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Frank K. Soong,et al.  A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL) , 2013, INTERSPEECH.

[23]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[24]  J. Flege Second Language Speech Learning Theory , Findings , and Problems , 2006 .

[25]  W. Strange Speech perception and linguistic experience : issues in cross-language research , 1995 .