Decision tree-based simultaneous clustering of phonetic contexts, dimensions, and state positions for acoustic modeling

In recent years, context-dependent hidden Markov model, typically triphones and continuous density HMMs are often used. The use of triphones results in too many free-parameters in a system, hence it is difficult to estimate the model which is statistically reliable. Therefore, various parameter clustering techniques have been proposed. The use of Phonetic Decision Trees (P-DT) based state-tying technique is a good solution to this problem. However, state-tying technique cannot construct proper context-dependent sharing structure and cannot assign proper number of free-parameter for each dimension. In this paper, Phonetic and Dimensional Decision Trees (PD-DT) is proposed by introducing the MDL-based dimensional-split technique into P-DT. Furthermore, by incorpolating questions about state positions into PD-DT, Phonetic, Dimensional and State positional Decision Trees (PDS-DT) is defined. In speaker-independent continuous speech recognition experiments, proposed technique achieved about 13%–15% error reduction over P-DT based state-tying technique.

[1]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[2]  Wu Chou,et al.  A unified approach of incorporating general features in decision tree based acoustic modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Mari Ostendorf,et al.  HMM topology design using maximum likelihood successive state splitting , 1997, Comput. Speech Lang..

[4]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[5]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[6]  Mari Ostendorf,et al.  Use of higher level linguistic structure in acoustic modeling for speech recognition , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Shigeki Sagayama,et al.  A successive state splitting algorithm for efficient allophone modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[9]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[10]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[11]  Mei-Yuh Hwang,et al.  Predicting unseen triphones with senones , 1996, IEEE Trans. Speech Audio Process..

[12]  Heiga Zen,et al.  Speech recognition using voice-characteristic-dependent acoustic models , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[13]  Heiga Zen,et al.  Decision tree distribution tying based on a dimensional split technique , 2002, INTERSPEECH.

[14]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Steve Young,et al.  Benchmark DARPA RM results using the HTK portable HMM toolkit , 1992 .

[16]  Roland Kuhn,et al.  Improving decision trees for acoustic modeling , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  Wu Chou,et al.  Decision tree state tying based on penalized Bayesian information criterion , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[18]  Shigeki Matsuda,et al.  Asynchronous-transition HMM , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[19]  Shigeki Matsuda,et al.  Feature-dependent allophone clustering , 2000, INTERSPEECH.

[20]  Shigeki Sagayama,et al.  Asynchronous-Transition HMM for Acoustic Modeling , 2000 .