Unified decoding and feature representation for improved speech recognition

In this paper we propose a unified framework for decoding and feature representation based on the Maximum A Posterior (MAP) principle. The search space is augmented with an additional feature stream dimension such that different feature representations can be utilized for different phonetic context under the HMM decoding framework. We also provide a theoretic explanation for the unified framework. It gives us “supervised” signal processing and feature extraction for the recognition system, which has reduced the word recognition error rate by 15% on a large-vocabulary continuous speech recognition task when multiple feature streams are used simultaneously.

[1]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[2]  Mark J. F. Gales,et al.  Recent improvements to IBM's speech recognition system for automatic transcription of broadcast news , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[3]  Mei-Yuh Hwang,et al.  Microsoft Windows highly intelligent speech recognizer: Whisper , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Jeff A. Bilmes,et al.  Dynamic classifier combination in hybrid speech recognition systems using utterance-level confidence values , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[5]  Mei-Yuh Hwang,et al.  Unified stochastic engine (USE) for speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[7]  Steve Young,et al.  The development of the 1996 HTK broadcast news transcription system , 1996 .

[8]  Brian Kingsbury,et al.  Recognizing reverberant speech with RASTA-PLP , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Kuansan Wang,et al.  Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..