Pronunciation Modeling for Spontaneous Mandarin Speech Recognition

Pronunciation variations in spontaneous speech can be classified into complete changes and partial changes. A complete change is the replacement of a canonical phoneme by another alternative phone, such as 'b' being pronounced as 'p'. Partial changes are variations within the phoneme such as nasalization, centralization and voiced. Most current work in pronunciation modeling for spontaneous Mandarin speech remains at the phone level and can model only complete changes, not partial changes. In this paper, we show that partial changes are much less clear-cut than previously assumed and cannot be modelled by mere representation by alternate phone units. We present a solution for modeling both complete changes and partial changes in spontaneous Mandarin speech.In order to model complete changes, we adapted the decision tree-based pronunciation modeling from English to Mandarin to predict alternate pronunciations. To solve the data sparseness problem, we used cross-domain data to estimate pronunciation variability. To discard the unreliable alternative pronunciations, we proposed a likelihood ratio test as a confidence measure to evaluate the degree of phonetic confusions. In order to model partial changes, we proposed partial change phone models (PCPM) with acoustic model reconstruction. PCPMs are regarded as extended units of standard phoneme or initial/final subword units, and can be used efficiently to represent partial changes. In order to avoid model confusion, we generated auxiliary decision trees for PCPM triphones, and used decision tree merge to perform acoustic model reconstruction. The effectiveness of these approaches was evaluated on the 1997 Hub4NE Mandarin Broadcast News corpus with different styles of speech. Our phone level pronunciation modeling provided an absolute 0.9% syllable error rate reduction, and the acoustic model reconstruction approach was more efficient than that to cover pronunciation variations, yielding a significant 2.39% absolute reduction in syllable error rate for spontaneous speech. In addition, our proposed method deals with partial changes at the acoustic model level and can be applied to any automatic speech recognition system based on subword units.

[1]  Atsushi Nakamura Restructuring Gaussian mixture density functions in speaker-independent acoustic models , 2002, Speech Commun..

[2]  William J. Byrne,et al.  Pronunciation modelling using a hand-labelled corpus for conversational speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Lin-Shan Lee,et al.  Pronunciation variation analysis with respect to various linguistic levels and contextual conditions for Mandarin Chinese , 2001, INTERSPEECH.

[5]  Sanjeev Khudanpur,et al.  Pronunciation ambiguity vs. pronunciation variability in speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[7]  William J. Byrne,et al.  CASS: a phonetically transcribed corpus of mandarin spontaneous speech , 2000, INTERSPEECH.

[8]  Sanjeev Khudanpur,et al.  Pronunciation modeling for conversational speech recognition , 2001 .

[9]  Frederick Jelinek,et al.  Probabilistic classification of HMM states for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Bo Xu,et al.  Mandarin accent adaptation based on context-independent/context-dependent pronunciation modeling , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[11]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[12]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[13]  Chao Huang,et al.  Accent modeling based on pronunciation dictionary adaptation for large vocabulary Mandarin speech recognition , 2000, INTERSPEECH.

[14]  Helmer Strik,et al.  Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation , 1999, Speech Commun..

[15]  Pascale Fung,et al.  Rule-Based Word Pronunciation Networks Generation for Mandarin Speech Recognition , 2000 .

[16]  T. Kamm,et al.  Pronunciation Modeling of Mandarin Casual Speech , 2000 .

[17]  Thomas Fang Zheng,et al.  Modeling pronunciation variation using context-dependent weighting and b/s refined acoustic modeling , 2001, INTERSPEECH.

[18]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[19]  Patgi KAM,et al.  MODELING PRONUNCIATION VARIATION FOR CANTONESE SPEECH RECOGNITION , 2000 .

[20]  Thomas Fang Zheng,et al.  Automatic generation of pronunciation lexicons for Mandarin spontaneous speech , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[21]  Steve Young,et al.  The HTK book , 1995 .

[22]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[23]  Detlef Koll,et al.  Modeling and efficient decoding of large vocabulary conversational speech , 1999, EUROSPEECH.

[24]  William J. Byrne,et al.  Stochastic pronunciation modelling from hand-labelled phonetic corpora , 1999, Speech Commun..

[25]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[26]  Torbjørn Svendsen,et al.  Maximum likelihood modelling of pronunciation variation , 1999, Speech Commun..