State-dependent phonetic tied mixtures with pronunciation modeling for spontaneous speech recognition

We propose a method of incorporating pronunciation modeling into acoustic models with high discriminative power and low complexity to improve spontaneous speech recognition accuracy. Spontaneous speech contains a higher level of phonetic and acoustic confusions due to the larger degree of pronunciation variations caused by speaking rate, speaker style, speaking mode, speaker accent, etc. In general data-driven complexity-reduction methods without explicit modeling of pronunciation variations, the acoustic model is not robust enough to capture the flexible phonetic confusions and pronunciation variants in spontaneous speech. We propose a state-dependent phonetic tied-mixture (PTM) model with variable codebook size to improve the coverage of phonetic variations while maintaining model discriminative ability. Our state-dependent PTM model incorporates a state-level pronunciation model for better discrimination of phonetic and acoustic confusions, while reducing model complexity. Experimental results on the spontaneous speech part of Mandarin Broadcast News shows that our model outperforms state tying and mixture tying models by 2.46% and 3.51% absolute syllable error rate reduction, respectively, with comparable model complexity. After adding Gaussian sharing to the latter models, our proposed model still yields an additional 1% and 2.6% absolute syllable error rate reduction. In addition, unlike many complexity reduction methods, our method does not lead to any performance degradation on read speech.

[1]  T. Kamm,et al.  Pronunciation Modeling of Mandarin Casual Speech , 2000 .

[2]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[3]  Kiyohiro Shikano,et al.  A new phonetic tied-mixture model for efficient decoding , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[4]  Atsushi Nakamura Restructuring Gaussian mixture density functions in speaker-independent acoustic models , 2002, Speech Commun..

[5]  Xuedong Huang,et al.  Unified techniques for vector quantization and hidden Markov modeling using semi-continuous models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[6]  Frederick Jelinek,et al.  Probabilistic classification of HMM states for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[7]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[8]  Liang Gu,et al.  Sub-state tying in tied mixture hidden Markov models , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[9]  William J. Byrne,et al.  CASS: a phonetically transcribed corpus of mandarin spontaneous speech , 2000, INTERSPEECH.

[10]  Yung-Hwan Oh,et al.  Stochastic lexicon modeling for speech recognition , 1999, IEEE Signal Process. Lett..

[11]  Andrej Ljolje,et al.  Automatic Generation of Detailed Pronunciation Lexicons , 1996 .

[12]  Pascale Fung,et al.  Automatic phone set extension with confidence measure for spontaneous speech , 2003, INTERSPEECH.

[13]  Ellen Eide Automatic modeling of pronunciation variations , 1999, EUROSPEECH.

[14]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[15]  William J. Byrne,et al.  Pronunciation modelling using a hand-labelled corpus for conversational speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Eric Moulines,et al.  An algorithm for maximum likelihood estimation of hidden Markov models with unknown state-tying , 1998, IEEE Trans. Speech Audio Process..

[17]  Thomas Hain,et al.  Dynamic HMM selection for continuous speech recognition , 1999, EUROSPEECH.

[18]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[19]  Pascale Fung,et al.  Modeling partial pronunciation variations for spontaneous Mandarin speech recognition , 2002, Comput. Speech Lang..

[20]  Alexander H. Waibel,et al.  Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition , 1997, EUROSPEECH.

[21]  Nelson Morgan,et al.  Dynamic pronunciation models for automatic speech recognition , 1999 .

[22]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[23]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[24]  Sanjeev Khudanpur,et al.  Pronunciation modeling for conversational speech recognition , 2001 .

[25]  Steve Young,et al.  The HTK book , 1995 .

[26]  Xuedong Huang,et al.  On semi-continuous hidden Markov modeling , 1990, International Conference on Acoustics, Speech, and Signal Processing.