论文信息 - Predicting unseen triphones with senones

Predicting unseen triphones with senones

In large-vocabulary speech recognition, we often encounter triphones that are not covered in the training data. These unseen triphones are usually backed off to their corresponding diphones or context-independent phones, which contain less context yet have plenty of training examples. We propose to use decision-tree-based senones to generate needed senonic baseforms for these unseen triphones. A decision tree is built for each Markov state of each base phone; the leaves of the trees constitute the senone pool. To find the senone associated with a Markov state of any triphone, the corresponding tree is traversed until a leaf node is reached. The effectiveness of the proposed approach was demonstrated in the ARPA 5000-word speaker-independent Wall Street Journal dictation task. The word error rate was reduced by 11% when unseen triphones were modeled by the decision-tree-based senones instead of context-independent phones. When there were more than five unseen triphones in each test utterance, the error rate reduction was more than 20%.

Mei-Yuh Hwang | Xuedong Huang | Fil Alleva

[1] Mei-Yuh Hwang,et al. Modeling between-word coarticulation in continuous speech recognition , 1989, EUROSPEECH.

[2] X. D. Huang,et al. Phoneme classification using semicontinuous hidden Markov models , 1992, IEEE Trans. Signal Process..

[3] Demetrios Kazakos,et al. Spectral distance measures between Gaussian processes , 1980, ICASSP.

[4] Mei-Yuh Hwang,et al. Subphonetic modeling with Markov states-Senone , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5] S. J. Young,et al. Tree-based state tying for high accuracy acoustic modelling , 1994 .

[6] T. Kailath. The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[7] L. R. Rabiner,et al. A probabilistic distance measure for hidden Markov models , 1985, AT&T Technical Journal.

[8] George Zavaliagkos,et al. Comparative Experiments on Large Vocabulary Speech Recognition , 1993, HLT.

[9] Aaron E. Rosenberg,et al. Improved Acoustic Modeling for Continuous Speech Recognition , 1990, HLT.

[10] John Makhoul,et al. Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11] Michael Picheny,et al. Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[12] Mei-Yuh Hwang,et al. Predicting unseen triphones with senones , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13] K.-F. Lee,et al. CMU robust vocabulary-independent speech recognition system , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[14] Mei-Yuh Hwang,et al. Shared-distribution hidden Markov models for speech recognition , 1993, IEEE Trans. Speech Audio Process..

[15] Rodney W. Johnson,et al. Axiomatic characterization of the directed divergences and their linear combinations , 1979, IEEE Trans. Inf. Theory.

[16] Jeff Shrager,et al. Automatic Discovery of Contextual Factors Describing Phonological Variation , 1989, HLT.

[17] J G Daugman,et al. Information Theory and Coding , 2005 .

[18] Frederick Jelinek,et al. Interpolated estimation of Markov source parameters from sparse data , 1980 .

[19] Mei-Yuh Hwang,et al. An Overview of the SPHINX-II Speech Recognition System , 1993, HLT.

[20] Mei-Yuh Hwang,et al. The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[21] Vassilios Digalakis,et al. Genones: optimizing the degree of mixture tying in a large vocabulary hidden Markov model based speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] L. M. M.-T.. Theory of Probability , 1929, Nature.

[23] Hsiao-Wuen Hon,et al. Vocabulary-independent speech recognition: the Vocind System , 1992 .

[24] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.