Enhanced tree clustering with single pronunciation dictionary for conversational speech recognition

Modeling pronunciation variation is key for recognizing conversational speech. Rather than being limited to dictionary modeling, we argue that triphone clustering is an integral part of pronunciation modeling. We propose a new approach called enhanced tree clustering .T his approach, in contrast to traditional decision tree based state tying, allows parameter sharing across phonemes. We show that accurate pronunciation modeling can be achieved through efficient parameter sharing in the acoustic model. Combined with a single pronunciation dictionary, a 1.8% absolute word error rate improvement is achieved on Switchboard, a large vocabulary conversational speech recognition task.

[1]  Harriet J. Nock,et al.  Pronunciation modeling by sharing gaussian densities across phonetic models , 1999, EUROSPEECH.

[2]  Xiuyang Yu,et al.  What kind of pronunciation variation is hard for triphones to model? , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Thomas Hain,et al.  IMPLICIT PRONUNCIATION MODELLING IN ASR , 2002 .

[4]  Frederick Jelinek,et al.  Probabilistic classification of HMM states for large vocabulary continuous speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[5]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[6]  Michael Finke,et al.  Wide context acoustic modeling in read vs. spontaneous speech , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.