论文信息 - Continuous visual speech recognition for multimodal fusion

Continuous visual speech recognition for multimodal fusion

It is admitted that human speech perception is a multimodal process that combines both visual and acoustic informations. In automatic speech perception, visual analysis is also crucial as it provides a complementary information in order to enhance the performances of audio systems especially in highly noisy environments. In this paper, we propose a unified probabilistic framework for speech unit recognition that combines both visual and audio informations. The method is based on the optimization of a criterion that achieves continuous speech unit segmentation and decoding using a learned (joint) phonetic-visemic model. Experiments conducted on the standard LIPS2008 dataset, show a clear and a consistent gain of our multimodal approach compared to others.

[1] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[2] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[3] Alex Acero,et al. Spoken Language Processing , 2001 .

[4] John Platt,et al. Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[5] Richard Bowden,et al. Learning temporal signatures for Lip Reading , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[6] Shaogang Gong,et al. Audio- and Video-based Biometric Person Authentication , 1997, Lecture Notes in Computer Science.

[7] Matti Pietikäinen,et al. Lipreading: A Graph Embedding Approach , 2010, 2010 20th International Conference on Pattern Recognition.

[8] Jiri Matas,et al. XM2VTSDB: The Extended M2VTS Database , 1999 .

[9] Naomi Harte,et al. Viseme definitions comparison for visual-only speech recognition , 2011, 2011 19th European Signal Processing Conference.

[10] Stephen J. Cox,et al. The challenge of multispeaker lip-reading , 2008, AVSP.

[11] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[12] Tuomas Virtanen,et al. Noise robust exemplar-based connected digit recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13] Timothy F. Cootes,et al. Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[14] Matti Pietikäinen,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON MULTIMEDIA 1 Lipreading with Local Spatiotemporal Descriptors , 2022 .

[15] Petros Maragos,et al. Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[16] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[17] Algirdas Pakstas,et al. MPEG-4 Facial Animation: The Standard,Implementation and Applications , 2002 .

[18] Jean-Philippe Thiran,et al. Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[19] A Markides,et al. Speechreading (lipreading). , 1979, Child: care, health and development.

[20] James R. Glass,et al. A segment-based audio-visual speech recognizer: data collection, development, and initial experiments , 2004, ICMI '04.

[21] Moshe Mahler,et al. Dynamic units of visual speech , 2012, SCA '12.

[22] Stanley F. Chen,et al. An empirical study of smoothing techniques for language modeling , 1999 .

[23] Hichem Sahbi,et al. Designing relevant features for visual speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24] Gérard Bailly,et al. LIPS2008: visual speech synthesis challenge , 2008, INTERSPEECH.

[25] Yoni Bauduin,et al. Audio-Visual Speech Recognition , 2004 .