Phonological feature-based speech recognition system for pronunciation training in non-native language learning.

The authors address the question whether phonological features can be used effectively in an automatic speech recognition (ASR) system for pronunciation training in non-native language (L2) learning. Computer-aided pronunciation training consists of two essential tasks-detecting mispronunciations and providing corrective feedback, usually either on the basis of full words or phonemes. Phonemes, however, can be further disassembled into phonological features, which in turn define groups of phonemes. A phonological feature-based ASR system allows the authors to perform a sub-phonemic analysis at feature level, providing a more effective feedback to reach the acoustic goal and perceptual constancy. Furthermore, phonological features provide a structured way for analysing the types of errors a learner makes, and can readily convey which pronunciations need improvement. This paper presents the authors implementation of such an ASR system using deep neural networks as an acoustic model, and its use for detecting mispronunciations, analysing errors, and rendering corrective feedback. Quantitative as well as qualitative evaluations are carried out for German and Italian learners of English. In addition to achieving high accuracy of mispronunciation detection, the system also provides accurate diagnosis of errors.

[1]  S. Blumstein,et al.  Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. , 1979, The Journal of the Acoustical Society of America.

[2]  Yali Amit,et al.  Robust acoustic object detection. , 2005, The Journal of the Acoustical Society of America.

[3]  Henning Reetz,et al.  Distinctive features: Phonological underspecification in representation and processing , 2010, J. Phonetics.

[4]  Vassilios Digalakis,et al.  Combination of machine scores for automatic grading of pronunciation quality , 2000, Speech Commun..

[5]  Frank K. Soong,et al.  A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training , 2016, IEEE ACM Trans. Audio Speech Lang. Process..

[6]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[7]  M. Halle,et al.  Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates , 1961 .

[8]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[9]  Li Deng,et al.  An overlapping-feature-based phonological model incorporating linguistic constraints: applications to speech recognition. , 2002, The Journal of the Acoustical Society of America.

[10]  Helmer Strik,et al.  Automatic pronunciation error detection in non-native speech: the case of vowel errors in Dutch. , 2013, The Journal of the Acoustical Society of America.

[11]  S. Blumstein,et al.  Invariant cues for place of articulation in stop consonants. , 1978, The Journal of the Acoustical Society of America.

[12]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[13]  Carol Espy-Wilson,et al.  A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition. , 2008, The Journal of the Acoustical Society of America.

[14]  Eric Fosler-Lussier,et al.  Articulatory feature-based pronunciation modeling , 2016, Comput. Speech Lang..

[15]  Kenneth N Stevens,et al.  Toward a model for lexical access based on acoustic landmarks and distinctive features. , 2002, The Journal of the Acoustical Society of America.

[16]  S. Blumstein,et al.  A reconsideration of acoustic invariance for place of articulation in diffuse stop consonants: evidence from a cross-language study. , 1981, The Journal of the Acoustical Society of America.

[17]  Vipul Arora,et al.  Phonological Feature Based Mispronunciation Detection and Diagnosis Using Multi-Task DNNs and Active Learning , 2017, INTERSPEECH.

[18]  Aren Jansen,et al.  Modeling the temporal dynamics of distinctive feature landmark detectors for speech recognition. , 2008, The Journal of the Acoustical Society of America.

[19]  Mark J. F. Gales,et al.  Improving the interpretability of deep neural networks with stimulated learning , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[20]  Simon King,et al.  Detection of phonological features in continuous speech using neural networks , 2000, Comput. Speech Lang..

[21]  Alexis Hervais-Adelman,et al.  The effect of phonetic production training with visual feedback on the perception and production of foreign speech sounds. , 2015, The Journal of the Acoustical Society of America.

[22]  Lin-Shan Lee,et al.  Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[23]  C. Browman,et al.  Articulatory Phonology: An Overview , 1992, Phonetica.

[24]  P Niyogi,et al.  Detecting stop consonants in continuous speech. , 2002, The Journal of the Acoustical Society of America.

[25]  Jianwu Dang,et al.  A real-time articulatory visual feedback approach with target presentation for second language pronunciation learning. , 2015, The Journal of the Acoustical Society of America.

[26]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[27]  Chin-Hui Lee,et al.  A Bottom-Up Modular Search Approach to Large Vocabulary Continuous Speech Recognition , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Silke M. Witt,et al.  Use of speech recognition in computer-assisted language learning , 2000 .

[29]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[30]  Aren Jansen,et al.  Point Process Models for Spotting Keywords in Continuous Speech , 2009, IEEE Transactions on Audio, Speech, and Language Processing.