Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration

We have developed visual preprocessing algorithms for extracting phonologically relevant features from the grayscale video image of a speaker, to provide speaker-independent inputs for an automatic lipreading ("speechreading") system. Visual features such as mouth open/closed, tongue visible/not-visible, teeth visible/notvisible, and several shape descriptors of the mouth and its motion are all rapidly computable in a manner quite insensitive to lighting conditions. We formed a hybrid speechreading system consisting of two time delay neural networks (video and acoustic) and integrated their responses by means of independent opinion pooling - the Bayesian optimal method given conditional independence, which seems to hold for our data. This hybrid system had an error rate 25% lower than that of the acoustic subsystem alone on a five-utterance speaker-independent task, indicating that video can be used to improve speech recognition.

[1]  D A Sanders,et al.  The relative contribution of visual and auditory components of speech to speech intelligibility as a function of three conditions of frequency distortion. , 1971, Journal of speech and hearing research.

[2]  D W Massaro,et al.  American Psychological Association, Inc. Evaluation and Integration of Visual and Auditory Information in Speech Perception , 2022 .

[3]  J. L. Miller,et al.  On the role of visual rate information in phonetic perception , 1985, Perception & psychophysics.

[4]  R. Campbell,et al.  Hearing by eye : the psychology of lip-reading , 1988 .

[5]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[6]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[7]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[8]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[9]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[10]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[11]  Antoinette T. Gesi,et al.  Long-term training, transfer, and retention in learning to lipread , 1993, Perception & psychophysics.

[12]  Alex Waibel,et al.  Bimodal sensor integration on the example of 'speechreading' , 1993, IEEE International Conference on Neural Networks.