Extraction of Visual Features for Lipreading

The multimodal nature of speech is often ignored in human-computer interaction, but lip deformations and other body motion, such as those of the head, convey additional information. We integrate speech cues from many sources and this improves intelligibility, especially when the acoustic signal is degraded. The paper shows how this additional, often complementary, visual speech information can be used for speech recognition. Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared. Two of these are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively. The third, bottom-up, method uses a nonlinear scale-space analysis to form features directly from the pixel intensity. All methods are compared on a multitalker visual speech recognition task of isolated letters.

[1]  J. O'neill Contributions of the visual components of oral symbols to speech comprehension. , 1954, The Journal of speech and hearing disorders.

[2]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[3]  K. K. Neely Effect of Visual Factors on the Intelligibility of Speech , 1956 .

[4]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[5]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[6]  G. Matheron Random Sets and Integral Geometry , 1976 .

[7]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[8]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[9]  H. McGurk,et al.  Visual influences on speech perception processes , 1978, Perception & psychophysics.

[10]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[11]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[12]  Andrew P. Witkin,et al.  Scale-Space Filtering , 1983, IJCAI.

[13]  T. Martin,et al.  On the effects of varying filter bank parameters on isolated word recognition , 1982 .

[14]  George Henry Dunteman,et al.  Introduction To Multivariate Analysis , 1984 .

[15]  Andrew P. Witkin,et al.  Scale-space filtering: A new approach to multi-scale description , 1984, ICASSP.

[16]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[17]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[18]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[19]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[20]  Alan L. Yuille,et al.  Feature extraction from faces using deformable templates , 1989, Proceedings CVPR '89: IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[21]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[22]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[23]  Ioannis Pitas,et al.  Morphological Shape Decomposition , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[24]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[25]  Timothy F. Cootes,et al.  The Use of Active Shape Models for Locating Structures in Medical Images , 1993, IPMI.

[26]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[27]  Timothy F. Cootes,et al.  Use of active shape models for locating structures in medical images , 1994, Image Vis. Comput..

[28]  Peter L. Silsbee Motion in deformable templates , 1994, Proceedings of 1st International Conference on Image Processing.

[29]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Timothy F. Cootes,et al.  A Probabilistic Fitness Measure for Deformable Template Models , 1994, BMVC.

[31]  Christopher J. Taylor,et al.  Automatic Landmark Generation for Point Distribution Models , 1994, BMVC.

[32]  Timothy F. Cootes,et al.  Active Shape Models: Evaluation of a Multi-Resolution Method for Improving Image Search , 1994, BMVC.

[33]  Steve Young,et al.  The HTK book , 1995 .

[34]  J. Andrew Bangham,et al.  Scale-space from nonlinear filters , 1995, Proceedings of IEEE International Conference on Computer Vision.

[35]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[36]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[37]  Lorenzo Torresani,et al.  2D Deformable Models for Visual Speech Analysis , 1996 .

[38]  Charles A. Poynton,et al.  A technical introduction to digital video , 1996 .

[39]  J. Andrew Bangham,et al.  Multiscale recursive medians, scale-space, and transforms with applications to image processing , 1996, IEEE Trans. Image Process..

[40]  Farzin Deravi,et al.  Design issues for a digital audio-visual integrated database , 1996 .

[41]  J. Andrew Bangham,et al.  Morphological scale-space preserving transforms in many dimensions , 1996, J. Electronic Imaging.

[42]  D. Stork,et al.  Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[43]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[44]  Yochai Konig,et al.  Towards a Robust Speechreading Dialog System , 1996 .

[45]  Javier R. Movellan,et al.  Channel Separability in the Audio-Visual Integration of Speech: A Bayesian Approach , 1996 .

[46]  J. Andrew Bangham,et al.  Nonlinear Scale-Space from n-Dimensional Sieves , 1996, ECCV.

[47]  Hans Peter Graf,et al.  Robust face feature analysis for automatic speechreading and character animation , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[48]  Andrew Blake,et al.  Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.

[49]  Juergen Luettin,et al.  Speechreading using shape and intensity information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[50]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[51]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[52]  Pierre Chardaire,et al.  Multiscale Nonlinear Decomposition: The Sieve Decomposition Theorem , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[53]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[54]  J. Andrew Bangham,et al.  The robustness of some scale-spaces , 1997, BMVC.

[55]  Stephen M. Omohundro,et al.  Learning Visual Models for Lipreading , 1997 .

[56]  Rainer Stiefelhagen,et al.  Preprocessing of visual speech under real world conditions , 1997, AVSP.

[57]  Stephen J. Cox,et al.  Combining noise compensation with visual information in speech recognition , 1997, AVSP.

[58]  R. Campbell,et al.  Hearing by eye 2 : advances in the psychology of speechreading and auditory-visual speech , 1997 .

[59]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[60]  平山亮 会議報告-Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[61]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[62]  Mubarak Shah,et al.  VISUALLY RECOGNIZING SPEECH USING EIGENSEQUENCES , 1997 .

[63]  Michael Vogt Interpreted multi-state lip models for audio-visual speech recognition , 1997, AVSP.

[64]  Mubarak Shah,et al.  Motion-Based Recognition , 1997, Computational Imaging and Vision.

[65]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[66]  Tony Lindeberg,et al.  Scale-Space Theory in Computer Vision , 1993, Lecture Notes in Computer Science.

[67]  Jiri Matas,et al.  Statistical chromaticity-based lip tracking with B-splines , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[68]  Stephen J. Cox,et al.  Lip reading from scale-space measurements , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[69]  Tsuhan Chen,et al.  Audio-visual interaction in multimedia communication , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[70]  Marcus Edward Hennecke Audio-visual speech recognition , 1998 .

[71]  Timothy F. Cootes,et al.  Face Recognition Using Active Appearance Models , 1998, ECCV.

[72]  Stephen J. Cox,et al.  A Comparison of Active Shape Model and Scale Decomposition Based Features for Visual Speech Recognition , 1998, ECCV.

[73]  Marcus E. Hennecke,et al.  Audio-visual speech recognition: preprocessing, learning and sensory integration , 1998 .

[74]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[75]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[76]  Alexander H. Waibel,et al.  Real-Time Face and Facial Feature Tracking and Applications , 1998, AVSP.

[77]  Andrew Blake,et al.  Accurate, real-time, unadorned lip tracking , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[78]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[79]  Timothy F. Cootes,et al.  Interpreting face images using active appearance models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[80]  Iain Matthews,et al.  Features for Audio-Visual Speech Recognition , 1998 .

[81]  Richard J. Harvey,et al.  Using occlusion models to evaluate scale-space processors , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[82]  Christopher J. Taylor,et al.  Developing a Measure of Similarity between Pixel Signatures , 1999, BMVC.

[83]  Mark Fisher,et al.  Scale-space Trees and Applications as Filters for Stereo Vision and Image Retrieval , 1999, BMVC.

[84]  Timothy F. Cootes,et al.  Comparing Active Shape Models with Active Appearance Models , 1999, BMVC.

[85]  L. H. Anauer,et al.  Speech Analysis and Synthesis by Linear Prediction of the Speech Wave , 2000 .

[86]  R. Campbell,et al.  Seeing Is Perceiving, Even When It Is Speech@@@Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-Visual Speech , 2002 .

[87]  J. Koenderink The structure of images , 2004, Biological Cybernetics.

[88]  Alan L. Yuille,et al.  Feature extraction from faces using deformable templates , 2004, International Journal of Computer Vision.