Adaptive fusion of acoustic and visual sources for automatic speech recognition

Among the various methods proposed to improve the accuracy and the robustness of automatic speech recognition (ASR), the use of additional knowledge sources is a successful one. In particular, a recent method proposes supplementing the acoustic information with visual data mostly derived from the speaker's lip shape. Perceptual studies support this approach by emphasising the importance of visual information for speech recognition in humans. This paper describes a method we have developed for adaptive integration of acoustic and visual information in ASR. Each modality is involved in the recognition process with a different weight, which is dynamically adapted during this process mainly according to the signal-to-noise ratio provided as a contextual input. We tested this method on continuous hidden Markov model-based systems developed according to direct identification (DI), separate identification (SI) and hybrid identification (DI + SI) strategies. Experiments performed under various noise-level conditions show that the DI + SI based system is the most promising one when compared to both DI and SI-based systems for a speaker-dependent continuous-spelling of French letters recognition task. They also confirm that using adaptive modality weights instead of fixed weights allows for performance improvement and that weight estimation could benefit from using visemes as decision units for the visual recogniser in SI and DI + SI based systems.

[1]  C. Benoît,et al.  Effects of phonetic context on audio-visual intelligibility of French. , 1994, Journal of speech and hearing research.

[2]  Pierre Jourlin Handling disynchronization phenomena with HMM in connected speech , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[3]  Alexandrina Rogozan,et al.  Asynchronous integration of visual information in an automatic speech recognition system , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[5]  N. Michael Brooke Talking Heads and Speech Recognisers That Can See: The Computer Processing of Visual Speech Signals , 1996 .

[6]  Régine André-Obrecht,et al.  Sound duration modelling and time-variable speaking rate in a speech recognition system , 1993, EUROSPEECH.

[7]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[8]  Jean-Luc Schwartz,et al.  Exploiting sensor fusion architectures and stimuli complementarity in AV speech recognition , 1996 .

[9]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[10]  SpeechRecognitionJavier R. Movellan Modularity and Catastrophic Fusion: a Bayesian Approach with Applications to Audiovisual Speech Recognition Modularity and Catastrophic Fusion: a Bayesian Approach with Applications to Audiovisual Speech Recognition , 1996 .

[11]  Gérard Bailly,et al.  Talking Machines: Theories, Models, and Designs , 1992 .

[12]  A. K. Cline Scalar- and planar-valued curve fitting using splines under tension , 1974, Commun. ACM.

[13]  Pierre Jourlin Word-dependent acoustic-labial weights in HMM-based speech recognition , 1997, AVSP.

[14]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[15]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[16]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[17]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[18]  S. Lesner Differences in visual intelligibility across talkers , 1982 .

[19]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[20]  D. Reisberg,et al.  Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. , 1987 .

[21]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[22]  Alexandrina Rogozan,et al.  Continuous visual speech recognition using geometric lip-shape models and neural networks , 1997, EUROSPEECH.

[23]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[24]  C. Benoît,et al.  28. The Intrinsic Bimodality of Speech Communication and the Synthesis of Talking Faces , 2000 .

[25]  李幼升,et al.  Ph , 1989 .

[26]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[27]  C. Benoît,et al.  A set of French visemes for visual speech synthesis , 1994 .

[28]  Régine André-Obrecht,et al.  Audio visual speech recognition and segmental master slave HMM , 1997, AVSP.

[29]  Q Summerfield,et al.  Use of Visual Information for Phonetic Perception , 1979, Phonetica.

[30]  P. Kricos Differences in Visual Intelligibility Across Talkers , 1996 .

[31]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[32]  David G. Stork,et al.  Visionary Speech: Looking Ahead to Practical Speechreading Systems , 1996 .