A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training

In this paper we investigate a Deep Neural Network (DNN) based approach to acoustic modeling of tonal language and assess its speech recognition performance with different features and modeling techniques. Mandarin Chinese, the most widely spoken tonal language, is chosen for testing the tone related ASR performance. Furthermore, the DNN-trained, tone-sensitive model is evaluated in automatic detection of mispronunciation among L2 Mandarin learners. The best DNN-HMM acoustic model of tonal syllable (initial and tonal final), trained with embedded F0 features, has shown improved ASR performance, when compared with the baseline DNN system of 39 MFCC features. The proposed system achieves better ASR performance than the baseline system, i.e., by 32% and 35% in relative tone error rate reduction and 20% and 23% in relative tonal syllable error rate reduction, for female and male speakers, respectively. In a speech database of L2 Mandarin learners (native speakers of European languages), 2% equal error rate reduction, from 27.5% to 25.5%, has been obtained with our DNN-HMM system in detecting mispronunciations, compared with the baseline system.

[1]  R. Espesser,et al.  Travaux de l’Institut de Phonétique d’Aix volume 15, pages 75-85 75 AUTOMATIC MODELLING OF FUNDAMENTAL FREQUENCY USING A QUADRATIC SPLINE FUNCTION , 2010 .

[2]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Frank K. Soong,et al.  The Use of DBN-HMMs for Mispronunciation Detection and Diagnosis in L2 English to Support Computer-Aided Pronunciation Training , 2012, INTERSPEECH.

[4]  Mei-Yuh Hwang,et al.  Incorporating tone-related MLP posteriors in the feature representation for Mandarin ASR , 2005, INTERSPEECH.

[5]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[6]  Michael Picheny,et al.  New methods in continuous Mandarin speech recognition , 1997, EUROSPEECH.

[7]  Chao Huang,et al.  Large vocabulary Mandarin speech recognition with different approaches in modeling tones , 2000, INTERSPEECH.

[8]  Frank K. Soong,et al.  A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL) , 2013, INTERSPEECH.

[9]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[10]  Keiichi Tokuda,et al.  INVITED PAPER Special Issue on the 2000 IEICE Excellent Paper Award Multi-Space Probability Distribution HMM ∗∗ , 2002 .

[11]  Frank K. Soong,et al.  A Multi-Space Distribution (MSD) and two-stream tone modeling approach to Mandarin speech recognition , 2009, Speech Commun..

[12]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[13]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[14]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.