Adaptive Decision Fusion for Audio-Visual Speech Recognition

While automatic speech recognition technologies have been successfully applied to realworld applications, there still exist several problems which need to be solved for wider application of the technologies. One of such problems is noise-robustness of recognition performance; although a speech recognition system can produce high accuracy in quiet conditions, its performance tends to be significantly degraded under presence of background noise which is usually inevitable in most of the real-world applications. Recently, audio-visual speech recognition (AVSR), in which visual speech information (i.e., lip movements) is used together with acoustic one for recognition, has received attention as a solution of this problem. Since the visual signal is not influenced by acoustic noise, it can be used as a powerful source for compensating for performance degradation of acousticonly speech recognition in noisy conditions. Figure 1 shows the general procedure of AVSR: First, the acoustic and the visual signals are recorded by a microphone and a camera, respectively. Then, salient and compact features are extracted from each signal. Finally, the two modalities are integrated for recognition of the given speech.

[1]  W. Lutzenberger,et al.  Sequential audiovisual interactions during speech perception: A whole-head MEG study , 2007, Neuropsychologia.

[2]  Dennis H. Klatt,et al.  Speech perception: a model of acoustic–phonetic analysis and lexical access , 1979 .

[3]  P F Seitz,et al.  The use of visible speech cues for improving auditory detection of spoken sentences. , 2000, The Journal of the Acoustical Society of America.

[4]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[5]  Ruth Campbell,et al.  Tracing Lip Movements: Making Speech Visible , 1988 .

[6]  Jeesun Kim,et al.  Investigating the audio-visual speech detection advantage , 2004, Speech Commun..

[7]  C. Benoît,et al.  Effects of phonetic context on audio-visual intelligibility of French. , 1994, Journal of speech and hearing research.

[8]  D. Pisoni,et al.  Auditory-visual speech perception and synchrony detection for speech and nonspeech signals. , 2006, The Journal of the Acoustical Society of America.

[9]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[10]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[11]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[12]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[13]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[14]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[15]  James L. McClelland,et al.  The TRACE model of speech perception , 1986, Cognitive Psychology.

[16]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[17]  Trent W. Lewis,et al.  Sensor Fusion Weighting Measures in Audio-Visual Speech Recognition , 2004, ACSC.

[18]  P. Arnold,et al.  Bisensory augmentation: a speechreading advantage when speech is clearly audible and intact. , 2001, British journal of psychology.

[19]  Y. Tohkura,et al.  Inter-language differences in the influence of visual cues in speech perception. , 1993 .

[20]  Simon Lucey An Evaluation of Visual Speech Features for the Tasks of Speech and Speaker Recognition , 2003, AVBPA.

[21]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[22]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[23]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[24]  Kuntal Sengupta,et al.  Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis , 2004, Speech Commun..

[25]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[26]  M. Ernst,et al.  Humans integrate visual and haptic information in a statistically optimal fashion , 2002, Nature.

[27]  D. Massaro Speech Perception By Ear and Eye: A Paradigm for Psychological Inquiry , 1989 .

[28]  Hero Wit,et al.  Activation in Primary Auditory Cortex during Silent Lipreading Is Determined by Sex , 2007, Audiology and Neurotology.

[29]  Sabri Gurbuz,et al.  Application of affine-invariant Fourier descriptors to lipreading for audio-visual speech recognition , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[30]  C. Fowler An event approach to the study of speech perception from a direct realist perspective , 1986 .

[31]  E. Bullmore,et al.  Activation of auditory cortex during silent lipreading. , 1997, Science.

[32]  Dominic W. Massaro,et al.  Speechreading: illusion or window into pattern recognition , 1999, Trends in Cognitive Sciences.

[33]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  A. Liberman,et al.  The motor theory of speech perception revised , 1985, Cognition.

[35]  L. Braida Crossmodal Integration in the Identification of Consonant Segments , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[36]  T.,et al.  Training Feedforward Networks with the Marquardt Algorithm , 2004 .

[37]  Cheol Hoon Park,et al.  Training Hidden Markov Models by Hybrid Simulated Annealing for Visual Speech Recognition , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[38]  Stephen J. Cox,et al.  Audiovisual speech recognition using multiscale nonlinear image decomposition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[39]  C. Benoît,et al.  28. The Intrinsic Bimodality of Speech Communication and the Synthesis of Talking Faces , 2000 .

[40]  B. Stein,et al.  The Merging of the Senses , 1993 .

[41]  Alexandrina Rogozan,et al.  Adaptive fusion of acoustic and visual sources for automatic speech recognition , 1998, Speech Commun..

[42]  J Robert-Ribes,et al.  Complementarity and synergy in bimodal speech: auditory, visual, and audio-visual identification of French oral vowels in noise. , 1998, The Journal of the Acoustical Society of America.

[43]  E Macaluso,et al.  Spatial and temporal factors during processing of audiovisual speech: a PET study , 2004, NeuroImage.

[44]  D. Massaro Perceiving talking faces: from speech perception to a behavioral principle , 1999 .

[45]  Juergen Luettin,et al.  A comparison of model and transform-based visual features for audio-visual LVCSR , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[46]  John J. Foxe,et al.  Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. , 2006, Cerebral cortex.

[47]  M. Sams,et al.  Primary auditory cortex activation by visual speech: an fMRI study at 3 T , 2005, Neuroreport.

[48]  Cheol Hoon Park,et al.  Robust Audio-Visual Speech Recognition Based on Late Integration , 2008, IEEE Transactions on Multimedia.

[49]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.