Continuous visual speech recognition for multimodal fusion

It is admitted that human speech perception is a multimodal process that combines both visual and acoustic informations. In automatic speech perception, visual analysis is also crucial as it provides a complementary information in order to enhance the performances of audio systems especially in highly noisy environments. In this paper, we propose a unified probabilistic framework for speech unit recognition that combines both visual and audio informations. The method is based on the optimization of a criterion that achieves continuous speech unit segmentation and decoding using a learned (joint) phonetic-visemic model. Experiments conducted on the standard LIPS2008 dataset, show a clear and a consistent gain of our multimodal approach compared to others.

[1]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[2]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[3]  Alex Acero,et al.  Spoken Language Processing , 2001 .

[4]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5]  Richard Bowden,et al.  Learning temporal signatures for Lip Reading , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[6]  Shaogang Gong,et al.  Audio- and Video-based Biometric Person Authentication , 1997, Lecture Notes in Computer Science.

[7]  Matti Pietikäinen,et al.  Lipreading: A Graph Embedding Approach , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[9]  Naomi Harte,et al.  Viseme definitions comparison for visual-only speech recognition , 2011, 2011 19th European Signal Processing Conference.

[10]  Stephen J. Cox,et al.  The challenge of multispeaker lip-reading , 2008, AVSP.

[11]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12]  Tuomas Virtanen,et al.  Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Matti Pietikäinen,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[15]  Petros Maragos,et al.  Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[17]  Algirdas Pakstas,et al.  MPEG-4 Facial Animation: The Standard,Implementation and Applications , 2002 .

[18]  Jean-Philippe Thiran,et al.  Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[19]  A Markides,et al.  Speechreading (lipreading). , 1979, Child: care, health and development.

[20]  James R. Glass,et al.  A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[21]  Moshe Mahler,et al.  Dynamic units of visual speech , 2012, SCA '12.

[22]  Stanley F. Chen,et al.  An empirical study of smoothing techniques for language modeling , 1999 .

[23]  Hichem Sahbi,et al.  Designing relevant features for visual speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Gérard Bailly,et al.  LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[25]  Yoni Bauduin,et al.  Audio-Visual Speech Recognition , 2004 .