论文信息 - Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration

Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration

We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automatic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/notvisible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and integrated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an error rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.

[1] D A Sanders,et al. The relative contribution of visual and auditory components of speech to speech intelligibility as a function of three conditions of frequency distortion. , 1971, Journal of speech and hearing research.

[2] D W Massaro,et al. American Psychological Association, Inc. Evaluation and Integration of Visual and Auditory Information in Speech Perception , 2022 .

[3] J. L. Miller,et al. On the role of visual rate information in phonetic perception , 1985, Perception & psychophysics.

[4] R. Campbell,et al. Hearing by eye : the psychology of lip-reading , 1988 .

[5] E. Petajan,et al. An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[6] J. Berger. Statistical Decision Theory and Bayesian Analysis , 1988 .

[7] David Taylor. Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[8] Alexander H. Waibel,et al. Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[9] Terrence J. Sejnowski,et al. Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[10] Gregory J. Wolff,et al. Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[11] Antoinette T. Gesi,et al. Long-term training, transfer, and retention in learning to lipread , 1993, Perception & psychophysics.

[12] Alex Waibel,et al. Bimodal sensor integration on the example of 'speechreading' , 1993, IEEE International Conference on Neural Networks.