Improving Mispronunciation Detection of Mandarin Tones for Non-Native Learners With Soft-Target Tone Labels and BLSTM-Based Deep Tone Models

We investigate the effectiveness of soft-target tone labels and sequential context information for mispronunciation detection of Mandarin lexical tones pronounced by second language (L2) learners whose first language (L1) is of European origin. In conventional approaches, prosodic information (e.g., F0 and tone posteriors extracted from trained tone models) is used to calculate goodness of pronunciation (GOP) scores or train binary classifiers to verify pronunciation correctness. We propose three techniques to improve detection of mispronunciation of Mandarin tones for non-native learners. First, we extend our tone model from a deep neural network (DNN) to a bidirectional long short-term memory (BLSTM) network in order to more accurately model the high variability of non-native tone productions and the contextual information expressed in tone-level co-articulation. Second, we characterize ambiguous pronunciations where L2 learners’ tone realizations are between two canonical tone categories by relaxing hard target labels to soft targets with probabilistic transcriptions. Third, segmental tone features fed into verifiers are extracted by a BLSTM to exploit sequential context information to improve mispronunciation detection. Compared to DNN-GOP trained with hard targets, the proposed BLSTM-GOP framework trained with soft targets reduces the tones’ averaged equal error rate (ERR) from 7.58% to 5.83% and the averaged area under ROC curve (AUC) is increased from 97.85% to 98.31%. By utilizing BLSTM-based verifiers the EER further decreases to 5.16%, and the AUC is increased to 98.47%.

[1]  Wei Li,et al.  Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  James R. Glass,et al.  Personalized mispronunciation detection and diagnosis based on unsupervised error pattern discovery , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Dong Wang,et al.  THCHS-30 : A Free Chinese Speech Corpus , 2015, ArXiv.

[4]  Tatsuya Kawahara,et al.  Modeling and automatic detection of English sentence stress for computer-assisted English prosody learning system , 2002, INTERSPEECH.

[5]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Jyh-Shing Roger Jang,et al.  Automatic Pronunciation Scoring with Score Combination by Learning to Rank and Class-Normalized DP-Based Quantization , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Jinsong Zhang,et al.  The preliminary study of influence on tone perception from segments , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[9]  Zhiyong Luo,et al.  Combination of Convolutional and Recurrent Neural Network for Sentiment Analysis of Short Texts , 2016, COLING.

[10]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11]  Rong Tong,et al.  Tokenizing fundamental frequency variation for Mandarin tone error detection , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Gina-Anne Levow,et al.  The functional load of tone in Mandarin is as high as that of vowels , 2004, Speech Prosody 2004.

[13]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Jian Cheng Automatic Tone Assessment of Non-Native Mandarin Speakers , 2012, INTERSPEECH.

[15]  Kun Li,et al.  Intonation classification for L2 English speech using multi-distribution deep neural networks , 2017, Comput. Speech Lang..

[16]  Stephanie Seneff,et al.  Annotation and features of non-native Mandarin tone quality , 2009, INTERSPEECH.

[17]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[18]  Haihua Xu,et al.  Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[19]  Shrikanth S. Narayanan,et al.  Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[20]  Lei Chen,et al.  End-to-End Neural Network Based Automated Speech Scoring , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Frank K. Soong,et al.  A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Wei Li,et al.  Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks , 2018, J. Signal Process. Syst..

[23]  Rong Tong,et al.  Goodness of tone (GOT) for non-native Mandarin tone recognition , 2015, INTERSPEECH.

[24]  Mangui Liang,et al.  Detecting tone errors in continuous Mandarin speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Chin-Hui Lee,et al.  Verifying and correcting recognition string hypotheses using discriminative utterance verification , 1997, Speech Commun..

[26]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[27]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[28]  Daniel P. W. Ellis,et al.  Feed-Forward Networks with Attention Can Solve Some Long-Term Memory Problems , 2015, ArXiv.

[29]  Rich Caruana,et al.  Do Deep Nets Really Need to be Deep? , 2013, NIPS.

[30]  Klaus Zechner,et al.  Using bidirectional lstm recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[31]  Chin-Hui Lee,et al.  Vocabulary independent discriminative utterance verification for nonkeyword rejection in subword based speech recognition , 1996, IEEE Trans. Speech Audio Process..

[32]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[33]  B. Juang,et al.  A study on robust utterance verification for connected digits recognition , 1997 .

[34]  Xu Li,et al.  Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks , 2018, Speech Commun..

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  Kun Li,et al.  Lexical stress detection for L2 English speech using deep belief networks , 2013, INTERSPEECH.

[37]  Chin-Hui Lee,et al.  Decision tree based tone modeling with corrective feedbacks for automatic Mandarin tone assessment , 2010, INTERSPEECH.

[38]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[39]  Wei Li,et al.  Using tone-based extended recognition network to detect non-native Mandarin tone mispronunciations , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[40]  Biing-Hwang Juang,et al.  Discriminative utterance verification for connected digits recognition , 1995, IEEE Trans. Speech Audio Process..

[41]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[42]  Mark Liberman,et al.  Mandarin tone classification without pitch tracking , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Yen-Chen Hao,et al.  Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers , 2012, J. Phonetics.

[44]  Frank K. Soong,et al.  Automatic Detection of Tone Mispronunciation in Mandarin , 2006, ISCSLP.

[45]  Frank K. Soong,et al.  A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[46]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[47]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[48]  Rong Tong,et al.  Context Aware Mispronunciation Detection for Mandarin Pronunciation Training , 2016, Interspeech.

[49]  Ren-Hua Wang,et al.  CDF-Matching for Automatic Tone Error Detection in Mandarin Call System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[50]  Wei Li,et al.  Improving Mandarin Tone Mispronunciation Detection for Non-Native Learners with Soft-Target Tone Labels and BLSTM-Based Deep Models , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[51]  Mark Liberman,et al.  Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information , 2014 .

[52]  Chunsheng Yang,et al.  The Acquisition of Mandarin Prosody by American Learners of Chinese as a Foreign Language (CFL) , 2011 .

[53]  Jinsong Zhang,et al.  Articulatory Modeling for Pronunciation Error Detection without Non-Native Training Data Based on DNN Transfer Learning , 2017, IEICE Trans. Inf. Syst..

[54]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[55]  Yongqiang Wang,et al.  Semi-Supervised Training in Deep Learning Acoustic Model , 2016, INTERSPEECH.

[56]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[57]  Rong Tong,et al.  Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL , 2016, Speech Commun..

[58]  Qi Zhang,et al.  The Influence on Realization and Perception of Lexical Tones from Affricate's Aspiration , 2017, INTERSPEECH.

[59]  Wei Li,et al.  Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models , 2017, INTERSPEECH.

[60]  Bo Xu,et al.  Update progress of Sinohear: advanced Mandarin LVCSR system at NLPR , 2000, INTERSPEECH.

[61]  Lin-Shan Lee,et al.  Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[62]  Wei Li,et al.  A study on Functional Loads of phonetic contrasts under context based on Mutual Information of Chinese text and phonemes , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[63]  Jinsong Zhang,et al.  Automatic Pronunciation Evaluation of Non-Native Mandarin Tone by Using Multi-Level Confidence Measures , 2016, INTERSPEECH.

[64]  Y Xu,et al.  Production and perception of coarticulated tones. , 1994, The Journal of the Acoustical Society of America.

[65]  Shozo Makino,et al.  Automatic Detection of English Mispronunciation Using Speaker Adaptation and Automatic Assessment of English Intonation and Rhythm , 2006 .

[66]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[67]  A. Jongman,et al.  Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training. , 2003, The Journal of the Acoustical Society of America.