Incorporating phonetic knowledge into a multi-stream HMM framework

This paper presents a technique for improving the performance of multi-stream HMMs in ASR systems. In this technique stream exponents of the multi-stream model are chosen with respect to the phonological content of the underlying states. Two distinctive feature sets namely MFCCs and formant-like features are used for investigating the potential of this technique. The experiments are performed on the AURORA database under the distributed speech recognition (DSR) framework. The proposed front-end constitutes an alternative to the DSR-XAFE (XAFE : eXtended Audio Front-End) provided by European Telecommunications Standards Institute. It is shown that the results obtained from the proposed method leads to improvement up to 10% in word accuracy relative to the word accuracy obtained form the multi-stream model with tied exponents and up to 35% relative improvement in word accuracy over the state-of-the-art MFCC-based system.

[1]  Peter Beyerlein,et al.  Discriminative model combination , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[2]  Philip N. Garner,et al.  On the robust incorporation of formant features into hidden Markov models for automatic speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[3]  Douglas D. O'Shaughnessy,et al.  Comparative experiments to evaluate the use of auditory-based acoustic distinctive features and formant cues for automatic speech recognition using a multi-stream paradigm , 2002, INTERSPEECH.

[4]  Philip N. Garner,et al.  Using formant frequencies in speech recognition , 1997, EUROSPEECH.

[5]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[6]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[7]  Douglas D. O'Shaughnessy,et al.  Speech communication : human and machine , 1987 .

[8]  Douglas D. O'Shaughnessy,et al.  Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hervé Bourlard,et al.  Multi-Stream Speech Recognition , 1996 .

[10]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Hermann Ney,et al.  Feature combination using linear discriminant analysis and its pitfalls , 2006, INTERSPEECH.

[12]  Sadaoki Furui,et al.  A stream-weight optimization method for multi-stream HMMs based on likelihood value normalization , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..