DNN based detection of pronunciation erroneous tendency in data sparse condition

Detecting pronunciation erroneous tendency (PET) can provide second languages learners with detailedly instructive feedbacks in the computer aided pronunciation training (CAPT) systems. Due to the data sparseness, DNN-HMM achieved limited improvement over GMM-HMM in our previous work. Instead of directly employing DNN-HMM to detect PETs, this paper investigated how to further improve the performance by DNN based features extracting in data sparse condition. Firstly, the probabilities of articulatory features derived from the top layer of DNN were fed into DNN-HMM. Secondly, the bottleneck features (BNF) extracted from the middle hidden layer were incorporated with original MFCC and then fed into SGMM-HMM. The experimental results showed that the new features converted from original acoustic features with DNN were more discriminative, and SGMM with BNF outperformed DNN in detecting PETs. The SGMM-HMM obtained the best detection results, achieving FRR of 5.3%, FAR of 29.6% and DA of 90%.

[1]  Wei Li,et al.  Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Mark J. F. Gales,et al.  Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..

[3]  Yu Hu,et al.  A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models , 2009, Speech Commun..

[4]  Jinsong Zhang,et al.  A study on robust detection of pronunciation erroneous tendency based on deep neural network , 2015, INTERSPEECH.

[5]  Mark Hasegawa-Johnson,et al.  Landmark-based automated pronunciation error detection , 2010, INTERSPEECH.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[8]  David G. Stork,et al.  Pattern Classification , 1973 .

[9]  Jinsong Zhang,et al.  A preliminary study on ASR-based detection of Chinese mispronunciation by Japanese learners , 2014, INTERSPEECH.

[10]  Daniel Povey,et al.  A Tutorial-style Introduction to Subspace Gaussian Mixture Models for Speech Recognition , 2009 .

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[12]  Xie Yanlu,et al.  A study on robust detection of pronunciation erroneous tendency based on deep neural network. , 2015, Interspeech 2015.

[13]  Helmer Strik,et al.  ASR-based corrective feedback on pronunciation: does it really work? , 2006, INTERSPEECH.

[14]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[15]  Wai Kit Lo,et al.  Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.

[16]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[17]  Helmer Strik,et al.  ASR corrective feedback on pronunciation: Does it really work? , 2006 .

[18]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[19]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[20]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[21]  Rong Tong,et al.  iCALL corpus: Mandarin Chinese spoken by non-native speakers of European descent , 2015, INTERSPEECH.

[22]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[23]  Horacio Franco,et al.  Automatic detection of phone-level mispronunciation for language learning , 1999, EUROSPEECH.

[24]  Gernot A. Fink,et al.  Combining acoustic and articulatory feature information for robust speech recognition , 2002, Speech Commun..

[25]  Ramesh A. Gopinath,et al.  Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[26]  Richard M. Schwartz,et al.  Practical Implementations of Speaker-Adaptive Training , 1997 .

[27]  Dong Wang,et al.  Subspace models for bottleneck features , 2013, INTERSPEECH.

[28]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[29]  Rong Tong,et al.  Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL , 2016, Speech Commun..

[30]  Jinsong Zhang,et al.  Developing a Chinese L2 speech database of Japanese learners with narrow-phonetic labels for computer assisted pronunciation training , 2010, INTERSPEECH.

[31]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.