Exploiting sensor fusion architectures and stimuli complementarity in AV speech recognition

The ambition of the present paper is to strengthen the bridge between audiovisual (AV) speech automatic recognition and cognitive psychology models. For this aim, it is necessary to better define and exploit the possible architectures for sensor fusion, and to better know the content of auditory (A) and visual (V) speech stimuli. We define four models organized around three basic questions about AV speech perception, and show that most recognition systems are based on only two of these models, and ignore one of them which happens to be most compatible with experimental data. Then we present a series of new experimental data that show the deep complementarity of the A and V sensors, both in the configurational and in the temporal domain, and the optimal use of this complementarity by the human AV fusion system. We submit the four models to a benchmark test on the identification of French vowels in noise, and show that only three of them exploit well the AV complementarity. Finally, we propose a general architecture for dealing with AV speech, namely the Timing-Target Model of speech perception, and present elements of implementation of some of its constituent modules.