Models for Audiovisual Fusion in a Noisy-Vowel Recognition Task

This paper presents a study of models for audiovisual (AV) fusion in a noisy-vowel recognition task. We progressively elaborate audiovisual models in order to respect the major principle demonstrated by human subjects in speech perception experiments (the “synergy” principle): audiovisual identification should always be more efficient than auditory-alone or visual-alone identification. We first recall that the efficiency of audiovisual speech recognition systems depends on the level at which they fuse sound and image: four AV architectures are presented, and two are selected for the following of the study. Secondly, we show the importance of providing a contextual input linked to the Signal-to-Noise Ratio (SNR) in the fusion process. Then we propose an original approach using an efficient nonlinear dimension reduction algorithm (“curvilinear components analysis”) in order to increase the performances of the two AV architectures. Furthermore, we show that this approach allows an easy and efficient estimation of the reliability of the audio sensor in relation to SNR, that this estimation can be used to control the AV fusion process, and that it significantly improves the AV performances. Hence, altogether, nonlinear dimension reduction, context estimation and control of the fusion process enable us to respect the “synergy” criterion for the two most used architectures.

[1]  C. Benoît,et al.  Effects of phonetic context on audio-visual intelligibility of French. , 1994, Journal of speech and hearing research.

[2]  Louis-Jean Boë,et al.  La parole et son traitement automatique , 1989 .

[3]  J Robert-Ribes,et al.  Complementarity and synergy in bimodal speech: auditory, visual, and audio-visual identification of French oral vowels in noise. , 1998, The Journal of the Acoustical Society of America.

[4]  Belur V. Dasarathy,et al.  Decision fusion , 1994 .

[5]  D. Stork,et al.  Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[6]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[7]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[8]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[9]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[11]  Paul Duchnowski,et al.  Adaptive bimodal sensor fusion for automatic speechreading , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[12]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[13]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[14]  Isabelle Bloch Information combination operators for data fusion: a comparative review with classification , 1996, IEEE Trans. Syst. Man Cybern. Part A.

[15]  Jean-Luc Schwartz,et al.  Exploiting sensor fusion architectures and stimuli complementarity in AV speech recognition , 1996 .

[16]  Jeanny Hérault,et al.  Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets , 1997, IEEE Trans. Neural Networks.

[17]  A. Liberman,et al.  The motor theory of speech perception revised , 1985, Cognition.

[18]  Jean-Luc Schwartz,et al.  Constrained Neural Network for Estimating Sensor Reliability in Sensors Fusion , 1997, IWANN.

[19]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[20]  Anne Guérin-Dugué,et al.  Interpreting data through neural and statistical tools , 1996, ESANN.

[21]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[22]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .