Decision tree-based triphones are robust and practical for mandarian speech recognition

In large-vocabulary, speaker-independent speech recognition systems, modeling of vocabulary words by subword units is mandatory. This paper studies the use of triphone units for Mandarin speech recognition compared to biphone and context-independent phonetic units. In order to solve unseen triphones in speech recognition, decision-tree based clustering is used in triphone units. This method achieves high recognition performance with limited training data and also reduces the model training time. The robustness and effectiveness of the cross-word, treebased triphone units have been proved by the speakerindependent continuous Mandarin speech recognition task. The training computation time reduces by about 2.3 times after tying states for triphone models, the recognition syllable accuracy increases 28.7% compared to monophone units and by 13.5% compared to biphone units.

[1]  Stephen W. K. Fu,et al.  A Survey on Chinese Speech Recognition , 1995 .

[2]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1996, IEEE Trans. Speech Audio Process..

[3]  Luis A. Hernández Gómez,et al.  Context-dependent units for vocabulary-independent Spanish speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Wu Chou,et al.  Decision tree state tying based on segmental clustering for acoustic modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Lin-Shan Lee,et al.  Voice dictation of Mandarin Chinese , 1997, IEEE Signal Process. Mag..

[6]  Hermann Ney,et al.  Automatic question generation for decision tree based state tying , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).