A study on robust detection of pronunciation erroneous tendency based on deep neural network

Compared with scoring feedbacks, instructive feedbacks are more demanded by language learners using computer aided pronunciation training (CAPT) systems, which require detailed information about erroneous pronunciations along with phone errors. Pronunciation erroneous tendency (PET) defines a set of incorrect articulation configurations regarding main articulators and uttering manners for the phones respectively, and its robust detection contributes to the provision of appropriate instructive feedbacks. In our previous works, we designed a set of PET labels for CSL (Chinese as a second language) by Japanese learners, and conducted a preliminary detection study with GMM-HMM. This study is aimed at achieving a more robust detection of PETs by two approaches: employing DNN-HMM as the acoustic modeling, and comparing three kinds of acoustic features: MFCC, PLP, and filter-bank. Experimental results showed that the DNN-HMM PET modeling achieved more robust detection accuracies than the previous GMM-HMM, and the three kinds of features behaved differently. A lattice combination of the results of three feature systems led to the best PET results: FRR of 5.5%, FAR of 35.6%, and DA of 88.6%, which showed its efficiency.

[1]  Wai Kit Lo,et al.  Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.

[2]  Frank K. Soong,et al.  A new Neural Network based logistic regression classifier for improving mispronunciation detection of L2 language learners , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[3]  Jinsong Zhang,et al.  A preliminary study on ASR-based detection of Chinese mispronunciation by Japanese learners , 2014, INTERSPEECH.

[4]  Frank K. Soong,et al.  Generalized Segment Posterior Probability for Automatic Mandarin Pronunciation Evaluation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[5]  Frank K. Soong,et al.  The Use of DBN-HMMs for Mispronunciation Detection and Diagnosis in L2 English to Support Computer-Aided Pronunciation Training , 2012, INTERSPEECH.

[6]  Jinsong Zhang,et al.  Developing a Chinese L2 speech database of Japanese learners with narrow-phonetic labels for computer assisted pronunciation training , 2010, INTERSPEECH.

[7]  Frank K. Soong,et al.  A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Lan Wang,et al.  Improvement of Segmental Mispronunciation Detection with Prior Knowledge Extracted from Large L2 Speech Corpus , 2011, INTERSPEECH.

[9]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[10]  Helmer Strik,et al.  ASR-based corrective feedback on pronunciation: does it really work? , 2006, INTERSPEECH.

[11]  Frank K. Soong,et al.  A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL) , 2013, INTERSPEECH.

[12]  Wang Yunjia How Japanese learners of Chinese process the aspirated and unaspirated consonants in standard Chinese , 2004 .

[13]  Kun Li,et al.  Mispronunciation detection and diagnosis in l2 english speech using multi-distribution Deep Neural Networks , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[14]  Mark Hasegawa-Johnson,et al.  Landmark-based automated pronunciation error detection , 2010, INTERSPEECH.

[15]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..