Improving Mandarin Tone Mispronunciation Detection for Non-Native Learners with Soft-Target Tone Labels and BLSTM-Based Deep Models

We propose three techniques to improve mispronunciation detection of Mandarin tones of second language (L2) learners using tone-based extended recognition network (ERN). First, we extend our model from deep neural network (DNN) to bidirectionallon-short-term memory (BLSTM) in order to model tone-level co-articulation influenced by a broader temporal context (e.g., two or three consecutive Mandarin syllables). Second, we relax the hard labels to characterize the situations when a single tone class label is not enough because L2 learners' pronunciations are often between two canonical tone categories. Therefore, soft targets (a probabilistic transcription) are proposed for acoustic model training in place of conventional hard targets (one-hot targets). Third, we average tone scores produced by BLSTM models trained with hard and soft targets to seek the complementarity from modeling at the tone-target levels. Compared to our previous system based on the DNN-trained ERNs, the BLSTM-trained system with soft targets reduces the equal error rate (ERR) from 5.77% to 4.86%, and system combination decreases EER further to 4.34%, achieving a 24.78% relative error reduction.

[1]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[2]  San Duanmu,et al.  The Phonology of Standard Chinese , 2001 .

[3]  Shozo Makino,et al.  Automatic Detection of English Mispronunciation Using Speaker Adaptation and Automatic Assessment of English Intonation and Rhythm , 2006 .

[4]  A. Jongman,et al.  Acoustic and perceptual evaluation of Mandarin tone productions before and after perceptual training. , 2003, The Journal of the Acoustical Society of America.

[5]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[6]  Mangui Liang,et al.  Detecting tone errors in continuous Mandarin speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Yifan Gong,et al.  Learning small-size DNN with output-distribution-based criteria , 2014, INTERSPEECH.

[8]  Yen-Chen Hao,et al.  Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers , 2012, J. Phonetics.

[9]  Wei Li,et al.  Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Qi Zhang,et al.  The Influence on Realization and Perception of Lexical Tones from Affricate's Aspiration , 2017, INTERSPEECH.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Kun Li,et al.  Lexical stress detection for L2 English speech using deep belief networks , 2013, INTERSPEECH.

[13]  Chin-Hui Lee,et al.  Decision tree based tone modeling with corrective feedbacks for automatic Mandarin tone assessment , 2010, INTERSPEECH.

[14]  Jinsong Zhang,et al.  The preliminary study of influence on tone perception from segments , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[15]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Kun Li,et al.  Intonation classification for L2 English speech using multi-distribution deep neural networks , 2017, Comput. Speech Lang..

[17]  Ren-Hua Wang,et al.  CDF-Matching for Automatic Tone Error Detection in Mandarin Call System , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Rong Tong,et al.  Tokenizing fundamental frequency variation for Mandarin tone error detection , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Frank K. Soong,et al.  Automatic Detection of Tone Mispronunciation in Mandarin , 2006, ISCSLP.

[20]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[21]  Rong Tong,et al.  Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL , 2016, Speech Commun..

[22]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[23]  Mark Liberman,et al.  Mandarin tone classification without pitch tracking , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[25]  Y Xu,et al.  Production and perception of coarticulated tones. , 1994, The Journal of the Acoustical Society of America.

[26]  Wei Li,et al.  Using tone-based extended recognition network to detect non-native Mandarin tone mispronunciations , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[27]  Rong Tong,et al.  Goodness of tone (GOT) for non-native Mandarin tone recognition , 2015, INTERSPEECH.

[28]  William Chan,et al.  Transferring knowledge from a RNN to a DNN , 2015, INTERSPEECH.

[29]  Jinsong Zhang,et al.  Automatic Pronunciation Evaluation of Non-Native Mandarin Tone by Using Multi-Level Confidence Measures , 2016, INTERSPEECH.

[30]  Tatsuya Kawahara,et al.  Modeling and automatic detection of English sentence stress for computer-assisted English prosody learning system , 2002, INTERSPEECH.

[31]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[32]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jian Cheng Automatic Tone Assessment of Non-Native Mandarin Speakers , 2012, INTERSPEECH.

[34]  Haihua Xu,et al.  Maximum F1-Score Discriminative Training Criterion for Automatic Mispronunciation Detection , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[35]  Mark Liberman,et al.  Highly Accurate Mandarin Tone Classification In The Absence of Pitch Information , 2014 .

[36]  Frank K. Soong,et al.  A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37]  Wei Li,et al.  Improving Mandarin Tone Recognition Based on DNN by Combining Acoustic and Articulatory Features Using Extended Recognition Networks , 2018, J. Signal Process. Syst..

[38]  Rong Tong,et al.  Context Aware Mispronunciation Detection for Mandarin Pronunciation Training , 2016, Interspeech.

[39]  Bo Xu,et al.  Update progress of Sinohear: advanced Mandarin LVCSR system at NLPR , 2000, INTERSPEECH.