Multi-level adaptive network for accented Mandarin speech recognition

Accented speech recognition is more challenging than standard speech recognition due to acoustic and linguistic mismatch between standard and accented data. In this paper, we propose a new framework combining Tandem system to improve the discriminative ability of acoustic features with Multi-level Adaptive Network (MLAN) to incorporate information from standard Mandarin corpus and also to solve the data sparseness problem. Mandarin spoken by Guangzhou speakers is considered as the accented mandarin (accented Putonghua, A-PTH), while spoken by northern area as the standard mandarin (standard Putonghua, S-PTH). Significant character error rate reduction of 13.8% and 24.6% relative are obtained over the baseline GMM-HMM systems trained on mixed corpus including both A-PTH and S-PTH corpus, as well as only the A-PTH corpus respectively.

[1]  Tao Chen,et al.  Accent Issues in Large Vocabulary Continuous Speech Recognition , 2004, Int. J. Speech Technol..

[2]  Pascale Fung,et al.  Partial change accent models for accented Mandarin speech recognition , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[3]  Jean-Luc Gauvain,et al.  Multi-style MLP features for BN transcription , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Pascale Fung,et al.  Pronunciation Modeling for Spontaneous Mandarin Speech Recognition , 2004, Int. J. Speech Technol..

[5]  Mark J. F. Gales,et al.  Transcription of multi-genre media archives using out-of-domain data , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[6]  Chao Huang,et al.  Accent modeling based on pronunciation dictionary adaptation for large vocabulary Mandarin speech recognition , 2000, INTERSPEECH.

[7]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[8]  Frantisek Grézl,et al.  Optimizing bottle-neck features for lvcsr , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Hynek Hermansky,et al.  On use of task independent training data in tandem feature extraction , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Simon King,et al.  Growing bottleneck features for tandem ASR , 2008, INTERSPEECH.

[11]  Hynek Hermansky,et al.  Cross-lingual and multi-stream posterior features for low resource LVCSR systems , 2010, INTERSPEECH.

[12]  Pascale Fung,et al.  Multi-accent Chinese speech recognition , 2006, INTERSPEECH.

[13]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[14]  Pascale Fung,et al.  Acoustic and phonetic confusions in accented speech recognition , 2005, INTERSPEECH.

[15]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[16]  Philip C. Woodland,et al.  Using accent-specific pronunciation modelling for robust speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  Alex Waibel,et al.  Adaptation Methods For Non-Native Speech , 2001 .