Transparent pronunciation scoring using articulatorily weighted phoneme edit distance

For researching effects of gamification in foreign language learning for children in the "Say It Again, Kid!" project we developed a feedback paradigm that can drive gameplay in pronunciation learning games. We describe our scoring system based on the difference between a reference phone sequence and the output of a multilingual CTC phoneme recogniser. We present a white-box scoring model of mapped weighted Levenshtein edit distance between reference and error with error weights for articulatory differences computed from a training set of scored utterances. The system can produce a human-readable list of each detected mispronunciation's contribution to the utterance score. We compare our scoring method to established black box methods.

[1]  Vipul Arora,et al.  Phonological Feature Based Mispronunciation Detection and Diagnosis Using Multi-Task DNNs and Active Learning , 2017, INTERSPEECH.

[2]  Henning Reetz,et al.  Phonological feature-based speech recognition system for pronunciation training in non-native language learning. , 2018, The Journal of the Acoustical Society of America.

[3]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[4]  Lionel Fontan,et al.  Using Phonologically Weighted Levenshtein Distances for the Prediction of Microscopic Intelligibility , 2016, INTERSPEECH.

[5]  Shelley Shwu-Ching Young,et al.  The Game Embedded CALL System to Facilitate English Vocabulary Acquisition and Pronunciation , 2014, J. Educ. Technol. Soc..

[6]  Mikko Kurimo,et al.  SIAK - A Game for Foreign Language Pronunciation Learning , 2017, INTERSPEECH.

[7]  Helmer Strik,et al.  The goodness of pronunciation algorithm: a detailed performance study , 2009, SLaTE.

[8]  Jinsong Zhang,et al.  Multi-lingual and multi-task DNN learning for articulatory error detection , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[9]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[10]  Susan Fitt,et al.  Redundancy and productivity in the speech technology lexicon - can we do better? , 2006, INTERSPEECH.

[11]  Wai Kit Lo,et al.  Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.

[12]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[13]  Okko Johannes Räsänen,et al.  Comparison of Syllabification Algorithms and Training Strategies for Robust Word Count Estimation across Different Languages and Recording Conditions , 2018, INTERSPEECH.

[14]  Jinsong Zhang,et al.  Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.