State-dependent phoneme-based model merging for dialectal Chinese speech recognition

Aiming at building a dialectal Chinese speech recognizer from a standard Chinese speech recognizer with a small amount of dialectal Chinese speech, a novel, simple but effective acoustic modeling method, named state-dependent phoneme-based model merging (SDPBMM) method, is proposed and evaluated, where a tied-state of standard triphone(s) will be merged with a state of the dialectal monophone that is identical with the central phoneme in the triphone(s). It can be seen that the proposed method has a good performance however it will introduce a Gaussian mixtures expansion problem. To deal with it, an acoustic model distance measure, named pseudo-divergence based distance measure, is proposed based on the difference measurement of Gaussian mixture models and then implemented to downsize the model size almost without causing any performance degradation for dialectal speech. With a small amount of only 40-minute Shanghai-dialectal Chinese speech, the proposed SDPBMM achieves a significant absolute syllable error rate (SER) reduction of 5.9% for dialectal Chinese and almost no performance degradation for standard Chinese. In combination with a certain existing adaptation method, another absolute SER reduction of 1.9% can be further achieved.

[1]  Eric Fosler-Lussier,et al.  A Tutorial on Pronunciation Modeling for Large Vocabulary Speech Recognition , 2000, ELSNET Summer School.

[2]  Yi Su,et al.  Accent detection and speech recognition for Shanghai-accented Mandarin , 2005, INTERSPEECH.

[3]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[4]  Yonghong Yan,et al.  Speaker adaptation using constrained transformation , 2004, IEEE Transactions on Speech and Audio Processing.

[5]  Tao Chen,et al.  Accent issue in large vocabulary continuous speech recognition , 2004 .

[6]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[7]  Tao Chen,et al.  Accent Issues in Large Vocabulary Continuous Speech Recognition , 2004, Int. J. Speech Technol..

[8]  Pascale Fung,et al.  Pronunciation Modeling for Spontaneous Mandarin Speech Recognition , 2004, Int. J. Speech Technol..

[9]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[10]  Ralf Kompe,et al.  Generating non-native pronunciation variants for lexicon adaptation , 2004, Speech Commun..

[11]  Construction of Large-Scale Shanghai Putonghua Speech Corpus for Chinese Speech Recognition , 2022 .

[12]  Xia Wang,et al.  A contrastive investigation of standard Mandarin and accented Mandarin , 2003, INTERSPEECH.

[13]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1996, IEEE Trans. Speech Audio Process..

[14]  John H. L. Hansen,et al.  Advances in phone-based modeling for automatic accent classification , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[16]  William J. Byrne,et al.  A Dialectal Chinese Speech Recognition Framework , 2006, Journal of Computer Science and Technology.

[17]  J. Hansen,et al.  Dialect/accent classification via boosted word modeling , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[18]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[19]  Pascale Fung,et al.  Acoustic and phonetic confusions in accented speech recognition , 2005, INTERSPEECH.

[20]  Qian Huang,et al.  A new distance measure for probability distribution function of mixture type , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[21]  Laura Mayfield Tomokiyo,et al.  Recognizing Non-Native Speech: Characterizing and Adapting to Non-Native Usage in LVCSR , 2001 .

[22]  Hong Kook Kim,et al.  Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[23]  Chao Huang,et al.  Automatic accent identification using Gaussian mixture models , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[24]  Yi Liu,et al.  Effects and modeling of phonetic and acoustic confusions in accented speech. , 2005, The Journal of the Acoustical Society of America.

[25]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[26]  Hong Kook Kim,et al.  Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, ICASSP.

[27]  Karen Livescu Analysis and modeling of non-native speech for automatic speech recognition , 1999 .

[28]  Thomas Fang Zheng,et al.  Mandarin pronunciation modeling based on CASS corpus , 2008, Journal of Computer Science and Technology.

[29]  Yunxin Zhao,et al.  Fast model selection based speaker adaptation for nonnative speech , 2003, IEEE Trans. Speech Audio Process..

[30]  Vassilios Diakoloukas,et al.  Development of dialect-specific speech recognizers using adaptation methods , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  Mark Huckvale,et al.  Pronunciation variation modelling using accent features , 2005, INTERSPEECH.