Acoustic modeling with contextual additive structure for HMM-based speech recognition

This paper proposes an acoustic modeling technique based on an additive structure of context dependencies for HMM-based speech recognition. Typical context dependent models, e.g., triphone HMMs, have direct dependencies of phonetic contexts, i.e., if a phonetic context is given, the Gaussian distribution is specified immediately. This paper assumes a more complex structure, an additive structure of acoustic feature components which have different context dependencies. Since the output probability distribution is composed of additive component distributions, a number of different distributions can be efficiently represented by a combination of fewer distributions. To automatically extract additive components, this paper presents a context clustering algorithm for the additive structure model in which multiple decision trees are constructed simultaneously. Experimental results show that the proposed technique improves phoneme recognition accuracy with fewer number of distributions than the conventional triphone HMMs.

[1]  K. Nakajima,et al.  Speech recognition using dynamic transformation of phoneme templates depending on acoustic/phonetic environments , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  George Zavaliagkos,et al.  Convolutional density estimation in hidden Markov models for speech recognition , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[4]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[5]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.