Using prosody to improve Mandarin automatic speech recognition

In this paper, these problems of how to model and train Mandarin prosody dependent acoustic model and how to decode input speech based on prosody dependent speech recognition system will be discussed. We use automatic prosody labeling methods to annotate syllable prosodic break type and stress type on continuous speech corpus, and utilize our proposed methods to train prosody dependent tonal syllable model aiming at data sparse problem after prosody labeling. In this paper, we also utilize MSD-HSMM to model pitch, duration etc. influence factors of prosody, and at the same time, we unite MSD-HSMM model, prosody dependent tonal syllable duration model based on GMM and syntactical prosody model based on Maximum Entropy to decode. When compared with the baseline system, the performance of our prosody dependent speech recognition systems improves the correct rate of tonal syllable significantly.

[1]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  Wu Hua,et al.  An application of SAMPA-c for standard Chinese , 2000, INTERSPEECH.

[3]  Andreas Stolcke,et al.  Direct Modeling of Prosody: An Overview of Applications in Automatic Speech Processing , 2004 .

[4]  Bo Xu,et al.  Automatic Prosody Boundary Labeling of Mandarin Using Both Text and Acoustic Information , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[5]  Gökhan Tür,et al.  Modeling the prosody of hidden events for improved word recognition , 1999, EUROSPEECH.

[6]  Keikichi Hirose,et al.  Latent Prosody Model of Continuous Mandarin Speech , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[7]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[8]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[9]  Li Aijun,et al.  CHINESE PROSODY AND PROSODIC LABELING OF SPONTANEOUS SPEECH , 2002 .

[10]  Andreas Stolcke,et al.  Prosodic knowledge sources for automatic speech recognition , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[11]  Jeung-Yoon Choi,et al.  Prosody dependent speech recognition on radio news corpus of American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[13]  Bo Xu,et al.  Mandarin stress detection using hierarchical model based boosting classification and regression tree , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[14]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[15]  Chongjia Ni,et al.  Durational Characteristics and Pitch Characteristics of the Prosodic Phrase in Mandarin Chinese , 2008, 2008 Chinese Conference on Pattern Recognition.