A Frame-Based Context-Dependent Acoustic Modeling for Speech Recognition

We propose a novel acoustic model for speech recognition, named FCD (Frame-based Context Dependent) model. It can obtain a probability distribution by using a top-down clustering technique to simultaneously consider the local frame position in phoneme, phoneme duration, and phoneme context. The model topology is derived from connecting left-to-right HMM models without self-loop transition for each phoneme duration. Because the FCD model can change the probability distribution into a sequence corresponding with one phoneme duration, it can has the ability to generate a smooth trajectory of speech feature vector. We also performed an experiment to evaluate the performance of speech recognition for the model. In the experiment, 132 questions for frame position, 66 questions for phoneme duration and 134 questions for phoneme context were used to train the sub-phoneme FCD model. In order to compare the performance, left-to-right HMM and two types of HSMM models with almost same number of states were also trained. As a result, 18% of relative improvement of tri-phone accuracy was achieved by the FCD model.

[1]  R. Moore,et al.  Explicit modelling of state occupancy in hidden Markov models for automatic speech recognition , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Mikko Kurimo,et al.  Duration modeling techniques for continuous speech recognition , 2004, INTERSPEECH.

[3]  Yumi Wakita,et al.  State duration constraint using syllable duration for speech recognition , 1994, ICSLP.

[4]  Tetsunori Kobayashi,et al.  Partly hidden Markov model and its application to speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[5]  Heiga Zen,et al.  Decision tree-based simultaneous clustering of phonetic contexts, dimensions, and state positions for acoustic modeling , 2003, INTERSPEECH.

[6]  Kai-Fu Lee,et al.  Context-independent phonetic hidden Markov models for speaker-independent continuous speech recognition , 1990 .

[7]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[8]  Stanley F. Chen Compiling large-context phonetic decision trees into finite-state transducers , 2003, INTERSPEECH.

[9]  Jeff A. Bilmes,et al.  Buried Markov models: a graphical-modeling approach to automatic speech recognition , 2003, Comput. Speech Lang..

[10]  Mark Hasegawa-Johnson,et al.  Prosody dependent speech recognition with explicit duration modelling at intonational phrase boundaries , 2003, INTERSPEECH.

[11]  E. McDermott,et al.  Recognition method with parametric trajectory generated from mixture distribution HMMs , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[13]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[14]  Tan Lee,et al.  Explicit duration modeling for Cantonese connected-digit recognition , 2004, INTERSPEECH.

[15]  Satoshi Nakamura,et al.  Incorporating a Bayesian wide phonetic context model for acoustic rescoring , 2005, INTERSPEECH.

[16]  Herbert Gish,et al.  Parametric trajectory models for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.