Integrating multiple pronunciations during MCE-based acoustic model training for large vocabulary speech recognition

In this paper, we report on the implementation of an automatic method for discovering an appropriate pronunciation for each speech utterance of every speaker and integrating this new information into minimum classification error (MCE) based training algorithm. The proposed method allows a lot more flexibility in adapting multiple pronunciations during the existing supervised acoustic model training where the phoneme sequence of a particular word is always fixed irrespective of speaker accents and pronunciation variations. Several large vocabulary recognition results on French SpeechDat-II speech corpus show a consistent string error rate reduction of about 48% and 13% obtained by the proposed integrated method when compared to the MLEtrained and MCE-trained baseline systems.

[1]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[2]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[3]  Philip C. Woodland,et al.  The use of accent-specific pronunciation dictionaries in acoustic model training , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  William J. Byrne,et al.  Pronunciation modelling using a hand-labelled corpus for conversational speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Biing-Hwang Juang,et al.  Discriminative training of the pronunciation networks , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Hauke Schramm,et al.  Efficient integration of multiple pronunciations in a large vocabulary decoder , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Alexander H. Waibel,et al.  Dictionary learning for spontaneous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[8]  Kåre Jean Jensen,et al.  Multilingual text-to-phoneme mapping , 2001, INTERSPEECH.

[9]  Richard M. Stern,et al.  Automatic generation of subword units for speech recognition systems , 2002, IEEE Trans. Speech Audio Process..

[10]  Ellen Eide Automatic modeling of pronunciation variations , 1999, EUROSPEECH.

[11]  Lori Lamel,et al.  On designing pronunciation lexicons for large vocabulary continuous speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Rathinavelu Chengalvarayan Unified acoustic modeling for continuous speech recognition , 2000, INTERSPEECH.

[13]  Aaron E. Rosenberg,et al.  Improved Acoustic Modeling for Continuous Speech Recognition , 1990, HLT.

[14]  Rathinavelu Chengalvarayan Accent-independent universal HMM-based speech recognizer for american, australian and british English , 2001, INTERSPEECH.