论文信息 - Tree-based state tying for high accuracy acoustic modelling

Tree-based state tying for high accuracy acoustic modelling

The key problem to be faced when building a HMM-based continuous speech recogniser is maintaining the balance between model complexity and available training data. For large vocabulary systems requiring cross-word context dependent modelling, this is particularly acute since many such contexts will never occur in the training data. This paper describes a method of creating a tied-state continuous speech recognition system using a phonetic decision tree. This tree-based clustering is shown to lead to similar recognition performance to that obtained using an earlier data-driven approach but to have the additional advantage of providing a mapping for unseen triphones. State-tying is also compared with traditional model-based tying and shown to be clearly superior. Experimental results are presented for both the Resource Management and Wall Street Journal tasks.

[1] Raj Reddy,et al. Automatic Speech Recognition: The Development of the Sphinx Recognition System , 1988 .

[2] Michael Picheny,et al. Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees , 1991, HLT.

[3] Steve Young,et al. The general use of tying in phoneme-based HMM speech recognisers , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Steve Young,et al. The HTK hidden Markov model toolkit: design and philosophy , 1993 .

[5] Steve J. Young,et al. The use of state tying in continuous speech recognition , 1993, EUROSPEECH.

[6] Mari Ostendorf,et al. Maximum likelihood clustering of Gaussians for speech recognition , 1994, IEEE Trans. Speech Audio Process..

[7] Steve J. Young,et al. A One Pass Decoder Design For Large Vocabulary Recognition , 1994, HLT.

[8] Steve J. Young,et al. Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[9] Mei-Yuh Hwang,et al. Predicting unseen triphones with senones , 1996, IEEE Trans. Speech Audio Process..