Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data

For effective articulatory feedback in computer-assisted pronunciation training (CAPT) systems, we address effective articulatory models of second language (L2) learners' speech without using such data, which is difficult to collect and annotate in a large scale. Context-dependent articulatory attributes (placement and manner of articulation) are modeled based on deep neural network (DNN). In order to efficiently train the non-native articulatory models, we exploit large speech corpora of native and target language to model inter-language phenomena. This multi-lingual learning is then combined with multi-task learning, which uses phone-classification as a sub-task. These methods are applied to Mandarin Chinese pronunciation learning by Japanese native speakers. Effects are confirmed in the native attribute classification and pronunciation error detection of non-native speech.

[1]  Joost van Doremalen,et al.  The DISCO ASR-based CALL system: practicing L2 oral skills and beyond , 2012, LREC.

[2]  Helmer Strik,et al.  The Pedagogy-Technology Interface in Computer Assisted Pronunciation Training , 2002 .

[3]  Preeti Rao,et al.  Vowel mispronunciation detection using DNN acoustic models with cross-lingual training , 2015, INTERSPEECH.

[4]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[5]  Lou Boves,et al.  Assessment of dutch pronunciation by means of automatic speech recognition technology , 1998, ICSLP.

[6]  Gérard Bailly,et al.  Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding , 2007, Speech Commun..

[7]  Yin Song,et al.  Experimental study of discriminative adaptive training and MLLR for automatic pronunciation evaluation , 2011 .

[8]  Helmer Strik,et al.  Automatic pronunciation error detection: an acoustic-phonetic approach , 2004 .

[9]  Lucia Specia,et al.  Modelling Annotator Bias with Multi-task Gaussian Processes: An Application to Machine Translation Quality Estimation , 2013, ACL.

[10]  Ramya Rasipuram,et al.  Improving Articulatory Feature and Phoneme Recognition Using Multitask Learning , 2011, ICANN.

[11]  Yonghong Yan,et al.  A novel discriminative method for pronunciation quality assessment , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[12]  Tatsuya Kawahara,et al.  Recognition and verification of English by Japanese students for computer-assisted language learning system , 2002, INTERSPEECH.

[13]  Arumugam Rathinavelu,et al.  Three Dimensional Articulator Model for Speech Acquisition by Children with Hearing Loss , 2007, HCI.

[14]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[15]  Frank K. Soong,et al.  A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Tatsuya Kawahara,et al.  Automatic pronunciation error detection and guidance for foreign language learning , 1998, ICSLP.

[17]  Helmer Strik,et al.  Comparing classifiers for pronunciation error detection , 2007, INTERSPEECH.

[18]  Lin-Shan Lee,et al.  Improved approaches of modeling and detecting Error Patterns with empirical analysis for Computer-Aided Pronunciation Training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Lin-Shan Lee,et al.  Toward unsupervised discovery of pronunciation error patterns using universal phoneme posteriorgram for computer-assisted language learning , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  James R. Glass,et al.  Mispronunciation detection without nonnative training data , 2015, INTERSPEECH.

[21]  S. Seneff,et al.  Spoken Conversational Interaction for Language Learning , 2004 .

[22]  Wang Yunjia How Japanese learners of Chinese process the aspirated and unaspirated consonants in standard Chinese , 2004 .

[23]  Sascha Fagel,et al.  A 3-d virtual head as a tool for speech therapy for children , 2008, INTERSPEECH.

[24]  Hossein Farhady,et al.  Evaluation of the Usefulness of the Versant for English Test: A Response , 2008 .

[25]  Horacio Franco,et al.  Automatic detection of phone-level mispronunciation for language learning , 1999, EUROSPEECH.

[26]  Helmer Strik,et al.  Comparing different approaches for automatic pronunciation error detection , 2009, Speech Commun..

[27]  Jinsong Zhang,et al.  Developing a Chinese L2 speech database of Japanese learners with narrow-phonetic labels for computer assisted pronunciation training , 2010, INTERSPEECH.

[28]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[29]  Jieping Ye,et al.  Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis , 2015, IEEE Transactions on Big Data.

[30]  Frank K. Soong,et al.  Automatic mispronunciation detection for Mandarin , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Frank K. Soong,et al.  Generalized Segment Posterior Probability for Automatic Mandarin Pronunciation Evaluation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[32]  James R. Glass,et al.  Context-dependent pronunciation error pattern discovery with limited annotations , 2014, INTERSPEECH.

[33]  Wolfgang Menzel,et al.  Automatic detection and correction of non-native English pronunciations , 2000 .