Sensory integration in audiovisual automatic speech recognition

Methods of integrating audio and visual information in an audiovisual HMM-based ASR system are investigated. Experiments involve discrimination of a set of 22 consonants, with various integration strategies. The role of the visual subsystem is varied; for example, in one run, the subsystem attempts to classify all 22 consonants, while in other runs it attempts only broader classifications. In a second experiment, a new HMM formulation is employed, which incorporates the integration into the HMM at a pre-categorical stage. A single variable parameter allows the relative contribution of audio and visual information to be controlled. This form of integration can be very easily incorporated into existing audio-based continuous speech recognizers.<<ETX>>

[1]  Hynek Hermansky,et al.  Evaluation and optimization of perceptually-based ASR front-end , 1993, IEEE Trans. Speech Audio Process..

[2]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[4]  Francis Kubala,et al.  New uses for the N-Best sentence hypotheses within the BYBLOS speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[6]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[7]  Peter L. Silsbee Motion in deformable templates , 1994, Proceedings of 1st International Conference on Image Processing.

[8]  Hynek Hermansky,et al.  RASTA-PLP speech analysis technique , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.