Transition-based feature extraction within frame-based recognition

Recognition Zhihong Hu, Etienne Barnard, Ronald Cole ( zhihong,barnard,cole)@cse.ogi.edu Center for Spoken Language Understanding, Oregon Graduate Institute of Science and Technology Abstract Current frame-based speech recognition systems sample speech at a xed set of locations relative to each frame. Modeling the temporal dynamic behavior of speech is thereby complicated. This work shows that by explicitly using transitional information when extracting features, one can better model the acoustic phonetic structure, resulting in higher word level recognition performance. In this proposed approach, features representing local transitional information are used (a constant number of features are selected at each time frame, but the features are sampled near areas of greatest spectrum change within a relatively long window.) By explicitly modeling transitions in this way, we can also model local contextual information. Using this technique, the word level error rate decreased up to 30% on the databases we tested.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[3]  Steven Greenberg,et al.  Stochastic perceptual models of speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  James R. Glass,et al.  Statistical trajectory models for phonetic recognition , 1994, ICSLP.

[5]  Ronald A. Cole,et al.  English alphabet recognition with telephone speech , 1991, EUROSPEECH.

[6]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[7]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[8]  Michael Picheny,et al.  Robust methods for using context-dependent features and models in a continuous speech recognizer , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.