Lip-motion analysis for speech segmentation in noise

Abstract This paper explains how visual information from the lips and acoustic signals can be combined together for speech segmentation. The psychological aspects of lip-reading and current automatic lip-reading systems are reviewed. The paper describes an image processing system which can extract the velocity of the lips from image sequences. The velocity of the lips is estimated by a combination of morphological image processing and block matching techniques. The resultant velocity of the lips is used to locate the syllable boundaries. This information is particularly useful when the speech signal is corrupted by noise. The paper also demonstrates the correlation between speech signals and lip information. Data fusion techniques are used to combine the acoustic and visual information for speech segmentation. The principal results show that using the combination of visual and acoustic signals can reduce segmentation errors by at least 10.4% when the signal-to-noise ratio is lower than 15 dB.

[1]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[2]  Man-Wai Mak Application of artificial neural networks for speaker identification. , 1993 .

[3]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[4]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[5]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[6]  P. Pirsch,et al.  Advances in picture coding , 1985, Proceedings of the IEEE.

[7]  N. L. Hesselmann Structural analysis of lip-contours for isolated spoken vowels using fourier descriptors , 1983, Speech Commun..

[8]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[9]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[10]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[11]  C. Fowler Segmentation of coarticulated speech in perception , 1984, Perception & psychophysics.

[12]  N. M. Brooke,et al.  Analysis, synthesis, and perception of visible articulatory movements , 1983 .

[13]  D J Ostry,et al.  Similarities in the control of the speech articulators and the limbs: kinematics of tongue dorsum movement in speech. , 1983, Journal of experimental psychology. Human perception and performance.

[14]  Petros Maragos,et al.  Tutorial On Advances In Morphological Image Processing And Analysis , 1986, Other Conferences.

[15]  D. Ostry,et al.  Control of rate and duration of speech movements. , 1985, The Journal of the Acoustical Society of America.

[16]  J. Kelso,et al.  A qualitative dynamic analysis of reiterant speech production: phase portraits, kinematics, and dynamic modeling. , 1985, The Journal of the Acoustical Society of America.

[17]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[18]  Man-Wai Mak,et al.  A lip-tracking system based on morphological processing and block matching techniques , 1994, Signal Process. Image Commun..

[19]  N. P. Erber,et al.  Auditory, visual, and auditory-visual recognition of consonants by children with normal and impaired hearing. , 1972, Journal of speech and hearing research.

[20]  J. O'neill Contributions of the visual components of oral symbols to speech comprehension. , 1954, The Journal of speech and hearing disorders.

[21]  M H Goldstein,et al.  Comparing human and neural network lip readers. , 1991, The Journal of the Acoustical Society of America.

[22]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[23]  P K Kuhl,et al.  The role of visual information in the processing of , 1989, Perception & psychophysics.

[24]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[25]  Man-Wai Mak,et al.  Spectral transitivity functions for speech segmentation in noise , 1993 .

[26]  Mitch Eggers,et al.  Neural network data fusion concepts and application , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[27]  Mubarak Shah,et al.  Multi-sensor fusion: a perspective , 1990, Proceedings., IEEE International Conference on Robotics and Automation.