Speech recognition based on statistical models including multiple phonetic decision trees

We propose a speech recognition technique using multiple model structures. In the use of context-dependent models, decision-tree-based context clustering is applied to find an appropriate parameter tying structure. However, context clustering is usually performed on the basis of unreliable statistics of hidden Markov model (HMM) state sequences because the estimation of reliable state sequences requires an appropriate model structures, that cannot be obtained prior to context clustering. Therefore, context clustering and the estimation of state sequences essentially cannot be performed independently. To overcome this problem, we propose an optimization technique of state sequences based on an annealing process using multiple decision trees. In this technique, a new likelihood function is defined in order to treat multiple model structures, and the deterministic annealing expectation maximization algorithm is used as the training algorithm. Experimental continuous phoneme recognition results show that the proposed method of using only two decision trees achieved about an 11.1% relative error reduction over the conventional method.

[1]  Koichi Shinoda,et al.  Acoustic modeling based on the MDL principle for speech recognition , 1997, EUROSPEECH.

[2]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[3]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[4]  Jonathan G. Fiscus,et al.  REDUCED WORD ERROR RATES , 1997 .

[5]  Steve Young,et al.  The HTK book , 1995 .

[6]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[8]  Heiga Zen,et al.  Deterministic Annealing EM Algorithm in Acoustic Modeling for Speaker and Speech Recognition , 2005, IEICE Trans. Inf. Syst..

[9]  Naonori Ueda,et al.  Deterministic annealing EM algorithm , 1998, Neural Networks.

[10]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[11]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[12]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .