Weakly-supervised word-level pronunciation error detection in non-native English speech

We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to the limited amount of mispronounced L2 speech, the model is more likely to overfit. To limit this risk, we train it in a multi-task setup. In the first task, we estimate the probabilities of word-level mispronunciation. For the second task, we use a phoneme recognizer trained on phonetically transcribed L1 speech that is easily accessible and can be automatically annotated. Compared to state-of-the-art approaches, we improve the accuracy of detecting word-level pronunciation errors in AUC metric by 30% on the GUT Isle Corpus of L2 Polish speakers, and by 21.5% on the Isle Corpus of L2 German and Italian speakers.

[1]  Dawid Weber,et al.  Constructing a Dataset of Speech Recordings with Lombard Effect , 2020, 2020 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA).

[2]  Peter Plantinga,et al.  Towards Real-Time Mispronunciation Detection in Kids' Speech , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[4]  Nobuaki Minematsu Pronunciation assessment based upon the phonological distortions observed in language learners' utterances , 2004, INTERSPEECH.

[5]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[7]  Valentín Cardeñoso-Payo,et al.  Assessing Pronunciation Improvement in Students of English Using a Controlled Computer-Assisted Pronunciation Tool , 2020, IEEE Transactions on Learning Technologies.

[8]  Bozena Kostek,et al.  Detection of Lexical Stress Errors in Non-native (L2) English with Data Augmentation and Attention , 2021, Interspeech 2021.

[9]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[10]  Yoshua Bengio,et al.  End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[11]  Xunying Liu,et al.  CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Chiranjeevi Yarra,et al.  Noise robust goodness of pronunciation measures using teacher's utterance , 2019, SLaTE.

[13]  Frank K. Soong,et al.  Capturing L2 segmental mispronunciations with joint-sequence models in Computer-Aided Pronunciation Training (CAPT) , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[14]  Yoon Kim,et al.  Automatic pronunciation scoring for language instruction , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Chiranjeevi Yarra,et al.  An Improved Goodness of Pronunciation (GoP) Measure for Pronunciation Evaluation with DNN-HMM System Considering HMM Transition Probabilities , 2019, INTERSPEECH.

[16]  Ron J. Weiss,et al.  Unsupervised Speech Representation Learning Using WaveNet Autoencoders , 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[18]  Thomas Fang Zheng,et al.  ASR-Free Pronunciation Assessment , 2020, INTERSPEECH.

[19]  Bo Xu,et al.  Context-Dependent Duration Modeling with Backoff Strategy and Look-Up Tables for Pronunciation Assessment and Mispronunciation Detection , 2011, INTERSPEECH.

[20]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[21]  Ying Qin,et al.  Child Speech Disorder Detection with Siamese Recurrent Network Using Speech Attribute Features , 2019, INTERSPEECH.

[22]  He He,et al.  GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing , 2020, J. Mach. Learn. Res..

[23]  James R. Glass,et al.  Pronunciation assessment via a comparison-based system , 2013, SLaTE.

[24]  Heiga Zen,et al.  LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech , 2019, INTERSPEECH.

[25]  Yoshua Bengio,et al.  End-to-end attention-based large vocabulary speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Frank K. Soong,et al.  Paired Phone-Posteriors Approach to ESL Pronunciation Quality Assessment , 2018, INTERSPEECH.

[27]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Wai Kit Lo,et al.  Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.

[30]  Eric Atwell,et al.  The ISLE corpus: Italian and German spoken learner's English , 2003 .

[31]  Bozena Kostek,et al.  Mispronunciation Detection in Non-Native (L2) English with Uncertainty Modeling , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Thomas Hain,et al.  Automatic assessment of English learner pronunciation using discriminative classifiers , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Hang Zhang,et al.  AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data , 2020, ArXiv.

[34]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[35]  Diego Giuliani,et al.  The effectiveness of computer assisted pronunciation training for foreign language learning by children , 2008 .

[36]  Yuehai Wang,et al.  Text-conditioned Transformer for automatic pronunciation error detection , 2020, Speech Commun..

[37]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .