Deep segmental phonetic posterior-grams based discovery of non-categories in L2 English speech

Second language (L2) speech is often labeled with the native, phone categories. However, in many cases, it is difficult to decide on a categorical phone that an L2 segment belongs to. These segments are regarded as non-categories. Most existing approaches for Mispronunciation Detection and Diagnosis (MDD) are only concerned with categorical errors, i.e. a phone category is inserted, deleted or substituted by another. However, non-categorical errors are not considered. To model these non-categorical errors, this work aims at exploring non-categorical patterns to extend the categorical phone set. We apply a phonetic segment classifier to generate segmental phonetic posterior-grams (SPPGs) to represent phone segment-level information. And then we explore the non-categories by looking for the SPPGs with more than one peak. Compared with the baseline system, this approach explores more non-categorical patterns, and also perceptual experimental results show that the explored non-categories are more accurate with increased confusion degree by 7.3% and 7.5% under two different measures. Finally, we preliminarily analyze the reason behind those non-categories.

[1]  Nikhil Ketkar,et al.  Introduction to PyTorch , 2021, Deep Learning with Python.

[2]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[3]  E. Vajda Handbook of the International Phonetic Association: A Guide to the Use of the International Phonetic Alphabet , 2000 .

[4]  Alissa M. Harrison,et al.  Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English : The CUHK Experience Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English : The CUHK Experience , 2010 .

[5]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  George Trigeorgis,et al.  Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[9]  Frank K. Soong,et al.  A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL) , 2013, INTERSPEECH.

[10]  Yoon Kim,et al.  Automatic pronunciation scoring for language instruction , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  James R. Glass,et al.  Personalized mispronunciation detection and diagnosis based on unsupervised error pattern discovery , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[14]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Kun Li,et al.  Unsupervised Discovery of an Extended Phoneme Set in L2 English Speech for Mispronunciation Detection and Diagnosis , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Hao Wang,et al.  Phonetic posteriorgrams for many-to-one voice conversion without parallel data training , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[19]  Frank K. Soong,et al.  Generalized Segment Posterior Probability for Automatic Mandarin Pronunciation Evaluation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[20]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  R. Lado,et al.  Linguistics Across Cultures: Applied Linguistics for Language Teachers , 1957 .

[22]  E. Zee Chinese (Hong Kong Cantonese) , 1991, Journal of the International Phonetic Association.

[23]  James R. Glass,et al.  Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Lin-Shan Lee,et al.  Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[25]  Frank K. Soong,et al.  Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT) , 2010, INTERSPEECH.

[26]  Lin-Shan Lee,et al.  Toward unsupervised discovery of pronunciation error patterns using universal phoneme posteriorgram for computer-assisted language learning , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[28]  Wai Kit Lo,et al.  Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.