Audiovisual speech processing

We have reported activities in audiovisual speech processing, with emphasis on lip reading and lip synchronization. These research results have shown that, with lip reading, it is possible to enhance the reliability of audio speech recognition, which may result in a computer that can truly understand the user via hand-free natural spoken language even in a very noisy environments. Similarly, with lip synchronization, it is possible to render realistic talking heads with lip movements synchronized with the voice, which is very useful for human-computer interactions. We envision that in the near future, advancement in audiovisual speech processing will greatly increase the usability of computers. Once that happens, the cameras and the microphone may replace the keyboard and the mouse as better mechanisms for human-computer interaction.

[1]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[2]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[3]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Parke,et al.  Parameterized Models for Facial Animation , 1982, IEEE Computer Graphics and Applications.

[6]  R. D. Easton,et al.  Perceptual dominance during lipreading , 1982, Perception & psychophysics.

[7]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[8]  P. Bloom,et al.  High-quality digital audio in the entertainment industry: An overview of achievements and challenges , 1985, IEEE ASSP Magazine.

[9]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[10]  D. Reisberg,et al.  Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. , 1987 .

[11]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[12]  Kiyoharu Aizawa,et al.  An intelligent facial image coding driven by speech and phoneme , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[13]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[14]  Dolf Schinkel,et al.  Improved Speech Recognition Through Videotelephony: Experiments with the Hard of Hearing , 1991, IEEE J. Sel. Areas Commun..

[15]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[16]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[17]  M. Funnell,et al.  Audiovisual integration in perception of real words , 1992, Perception & psychophysics.

[18]  J C Junqua,et al.  The Lombard reflex and its role on human listeners and automatic speech recognizers. , 1993, The Journal of the Acoustical Society of America.

[19]  John S. D. Mason,et al.  Integration of acoustic and visual speech for speaker recognition , 1993, EUROSPEECH.

[20]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[22]  Stephen M. Omohundro,et al.  Nonlinear manifold learning for visual speech recognition , 1995, Proceedings of IEEE International Conference on Computer Vision.

[23]  F. Lavagetto,et al.  Converting speech into lip movements: a multimedia telephone for hard of hearing people , 1995 .

[24]  H.P. Graf,et al.  Lip synchronization using speech-assisted video processing , 1995, IEEE Signal Processing Letters.

[25]  Russell M. Mersereau,et al.  On merging hidden Markov models with deformable templates , 1995, Proceedings., International Conference on Image Processing.

[26]  Alexander H. Waibel,et al.  A real-time face tracker , 1996, Proceedings Third IEEE Workshop on Applications of Computer Vision. WACV'96.

[27]  Juergen Luettin,et al.  Speaker identification by lipreading , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[28]  Juergen Luettin,et al.  Speechreading using shape and intensity information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[29]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[30]  Angela Fuster-Duran Perception of Conflicting Audio-Visual Speech: an Examination across Spanish and German , 1996 .

[31]  D. Burnham,et al.  Auditory-Visual Speech Perception as a Direct Process: The McGurk Effect in Infants and Across Languages , 1996 .

[32]  Christoph von der Malsburg,et al.  Tracking and learning graphs and pose on image sequences of faces , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[33]  Jenq-Neng Hwang,et al.  Lipreading from color video , 1997, IEEE Trans. Image Process..

[34]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[35]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[36]  Tsuhan Chen,et al.  Password-free network security through joint use of audio and video , 1997, Other Conferences.

[37]  Tsuhan Chen,et al.  Audio-visual interaction in multimedia communication , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[38]  Aggelos K. Katsaggelos,et al.  Frame Rate and Viseme Analysis for Multimedia Applications to Assist Speechreading , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[39]  M. Ibrahim Sezan,et al.  A robust real-time face tracking algorithm , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[40]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Lionel Revéret,et al.  A New 3D Lip Model for Analysis and Synthesis of Lip Motion In Speech Production , 1998, AVSP.

[42]  Gary Bradski,et al.  Computer Vision Face Tracking For Use in a Perceptual User Interface , 1998 .

[43]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[44]  Jonas Beskow,et al.  Recent Developments In Facial Animation: An Inside View , 1998, AVSP.

[45]  John Yen,et al.  Emotionally expressive agents , 1999, Proceedings Computer Animation 1999.

[46]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[47]  Tsuhan Chen,et al.  Tracking of multiple faces for human-computer interfaces and virtual environments , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[48]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[49]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..