Statistical multimodal integration for audio-visual speech processing

Sensory information is indispensable for living things. It is also important for living things to integrate multiple types of senses to understand their surroundings. In human communications, human beings must further integrate the multimodal senses of audition and vision to understand intention. In this paper, we describe speech related modalities since speech is the most important media to transmit human intention. To date, there have been a lot of studies concerning technologies in speech communications, but performance levels still have room for improvement. For instance, although speech recognition has achieved remarkable progress, the speech recognition performance still seriously degrades in acoustically adverse environments. On the other hand, perceptual research has proved the existence of the complementary integration of audio speech and visual face movements in human perception mechanisms. Such research has stimulated attempts to apply visual face information to speech recognition and synthesis. This paper introduces works on audio-visual speech recognition, speech to lip movement mapping for audio-visual speech synthesis, and audio-visual speech translation.

[1]  Satoshi Nakamura,et al.  Speech-to-face movement synthesis based on HMMS , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[2]  Chalapathy Neti,et al.  Translingual visual speech synthesis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[3]  Tsuhan Chen,et al.  Real-time lip-synch face animation driven by human voice , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[4]  Javier Hernando,et al.  Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Nadia Magnenat-Thalmann,et al.  Lip synchronization using linear predictive analysis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[6]  Tsuhan Chen,et al.  Tracking of multiple faces for human-computer interfaces and virtual environments , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[7]  Alexandrina Rogozan,et al.  Asynchronous integration of audio and visual sources in bi-modal automatic speech recognition , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[8]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[9]  Jean-Luc Schwartz,et al.  A comparison of models for fusion of the auditory and visual sensors in speech perception , 1995, Artificial Intelligence Review.

[10]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[11]  Gerasimos Potamianos,et al.  Speaker adaptation for audio-visual speech recognition , 1999, EUROSPEECH.

[12]  Stephen E. Levinson,et al.  Speaker independent audio-visual speech recognition , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[13]  Keiichi Tokuda,et al.  Audio-visual speech recognition using MCE-based hmms and model-dependent stream weights , 2000, INTERSPEECH.

[14]  Thomas S. Huang,et al.  A new approach to integrate audio and visual features of speech , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[15]  Stephen M. Omohundro,et al.  Nonlinear manifold learning for visual speech recognition , 1995, Proceedings of IEEE International Conference on Computer Vision.

[16]  Peter L. Silsbee,et al.  Audiovisual Sensory Integration Using Hidden Markov Models , 1996 .

[17]  Ken W Grant,et al.  Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory–Visual Speech, edited by Ruth Campbell, Barbara Dodd, and Denis Burnham , 1999, Trends in Cognitive Sciences.

[18]  Ali Adjoudani,et al.  Audio-visual speech recognition compared across two architectures , 1995, EUROSPEECH.

[19]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[20]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[21]  Satoshi Nakamura,et al.  Subjective Evaluation for HMM-Based Speech-To-Lip Movement Synthesis , 1998, AVSP.

[22]  Stephen J. Cox,et al.  Combining noise compensation with visual information in speech recognition , 1997, AVSP.

[23]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[24]  P. L. Silsbee Sensory integration in audiovisual automatic speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[25]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[26]  Thomas S. Huang,et al.  Real time speech driven facial animation using formant analysis , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[27]  Satoshi Nakamura,et al.  Improved bimodal speech recognition using tied-mixture HMMs and 5000 word audio-visual synchronous database , 1997, EUROSPEECH.

[28]  Yoshihiko Nankaku,et al.  Intensity- and location-normalized training for HMM-based visual speech recognition , 1999, EUROSPEECH.

[29]  Jing Xiao,et al.  Automatic selection of visemes for image-based visual speech synthesis , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[30]  Keith Waters,et al.  Driving synthetic mouth gestures: phonetic recognition for faceme! , 1997, EUROSPEECH.

[31]  N. Michael Brooke,et al.  Two- and Three-Dimensional Audio-Visual Speech Synthesis , 1998, AVSP.

[32]  F. Lavagetto,et al.  Converting speech into lip movements: a multimedia telephone for hard of hearing people , 1995 .

[33]  Peter L. Silsbee,et al.  Robust audiovisual integration using semicontinuous hidden Markov models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[34]  Kiyoharu Aizawa,et al.  An intelligent facial image coding driven by speech and phoneme , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[35]  Satoshi Nakamura,et al.  Speech-to-Lip Movement Synthesis by Maximizing Audio-Visual Joint Probability Based on the EM Algorithm , 2001, J. VLSI Signal Process..

[36]  David G. Stork Sources of Neural Structure in Speech and Language Processing , 1991, Int. J. Neural Syst..

[37]  Martin Heckmann,et al.  Comparing audio- and a-posteriori-probability-based stream confidence measures for audio-visual speech recognition , 2001, INTERSPEECH.

[38]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[39]  Martin Heckmann,et al.  Labeling audio-visual speech corpora and training an ANN/HMM audio-visual speech recognition system , 2000, INTERSPEECH.

[40]  Biing-Hwang Juang,et al.  Pattern recognition using a family of design algorithms based upon the generalized probabilistic descent method , 1998, Proc. IEEE.

[41]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .

[42]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[43]  Satoshi Nakamura,et al.  Speech-to-lip movement synthesis maximizing audio-visual joint probability based on EM algorithm , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[44]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[45]  Takaaki Kuratate,et al.  Audio-visual synthesis of talking faces from speech production correlates. , 1999 .

[46]  Régine André-Obrecht,et al.  Audio visual speech recognition and segmental master slave HMM , 1997, AVSP.

[47]  Benoît Maison,et al.  Perceptual interfaces for information interaction: joint processing of audio and visual information for human-computer interaction , 2000, INTERSPEECH.

[48]  Satoshi Nakamura,et al.  An adaptive integration based on product hmm for audio-visual speech recognition , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[49]  E. Vatikiotis-Bateson,et al.  Eye movement of perceivers during audiovisualspeech perception , 1998, Perception & psychophysics.

[50]  Satoshi Nakamura,et al.  Model-based lip synchronization with automatically translated synthetic voice toward a multi-modal translation system , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[51]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[52]  Juergen Luettin,et al.  Asynchronous stream modeling for large vocabulary audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[53]  Homer H. Chen,et al.  Speech recognition for image animation and coding , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[54]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[55]  Tsuhan Chen,et al.  A new frame interpolation scheme for talking head sequences , 1995, Proceedings., International Conference on Image Processing.

[56]  David G. Stork,et al.  Using deformable templates to infer visual speech dynamics , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[57]  S. Morishima,et al.  Automatic face tracking and model match-move automatic face tracking and model match-move in video sequence using 3D face model in video sequence using 3D face model , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[58]  Levent M. Arslan,et al.  Codebook based face point trajectory synthesis algorithm using speech input , 1999, Speech Commun..

[59]  Keiichi Tokuda,et al.  HMM-based text-to-audio-visual speech synthesis , 2000, INTERSPEECH.

[60]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[61]  Thomas S. Huang,et al.  Bimodal speech recognition using coupled hidden Markov models , 2000, INTERSPEECH.

[62]  Satoshi Nakamura,et al.  State synchronous modeling of audio-visual information for bi-modal speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[63]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[64]  Christian Benoît,et al.  A 3-d model of the lips for visual speech synthesis , 1994, SSW.

[65]  Alexandrina Rogozan,et al.  Adaptive determination of audio and visual weights for automatic speech recognition , 1997, AVSP.

[66]  Pierre Jourlin Handling disynchronization phenomena with HMM in connected speech , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[67]  R. Campbell,et al.  Hearing by eye 2 : advances in the psychology of speechreading and auditory-visual speech , 1997 .

[68]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[69]  Satoshi Nakamura,et al.  Speech to lip movement synthesis by HMM , 1997, AVSP ...

[70]  Hervé Glotin,et al.  Weighting schemes for audio-visual fusion in speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[71]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[72]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[73]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[74]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[75]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[76]  Tsuhan Chen,et al.  Cross-modal prediction in audio-visual communication , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[77]  Christian Benoît,et al.  Which components of the face do humans and machines best speechread , 1996 .

[78]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[79]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[80]  Milind R. Naphade,et al.  Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[81]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[82]  Jenq-Neng Hwang,et al.  Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System , 2001, J. VLSI Signal Process..

[83]  Hitoshi Iida,et al.  A Japanese-to-English speech translation system: ATR-MATRIX , 1998, ICSLP.

[84]  Tomio Watanabe,et al.  Lip-reading of Japanese vowels using neural networks , 1990, ICSLP.

[85]  Hiroshi Harashima,et al.  A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface , 1991, IEEE J. Sel. Areas Commun..

[86]  N. Michael Brooke Talking Heads and Speech Recognisers That Can See: The Computer Processing of Visual Speech Signals , 1996 .

[87]  Satoshi Nakamura,et al.  Stream weight optimization of speech and lip image sequence for audio-visual speech recognition , 2000, INTERSPEECH.

[88]  Jean-Luc Schwartz,et al.  Models for audiovisual fusion in a noisy-vowel recognition task , 1997, Proceedings of First Signal Processing Society Workshop on Multimedia Signal Processing.

[89]  Satoshi Nakamura,et al.  Automatic Face Tracking And Model Match-Move In Video Sequence Using 3d Face Model , 2001, ICME.

[90]  Piero Cosi,et al.  Bimodal recognition experiments with recurrent neural networks , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[91]  Shigeo Morishima,et al.  3D Lip Expression Generation by using New Lip Parameters , 2000 .

[92]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[93]  S. Haykin,et al.  Pattern Recognition Using a Family of Design Algorithms Based upon the Generalized Probabilistic Descent Method , 2001 .

[94]  Kenji Kurosu,et al.  Neural network vowel-recognition jointly using voice features and mouth shape image , 1991, Pattern Recognit..